Efficient Triton Kernels for LLM Training
-
Updated
Feb 24, 2026 - Python
Efficient Triton Kernels for LLM Training
FlagGems is an operator library for large language models implemented in the Triton Language.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
A light llama-like llm inference framework based on the triton kernel.
Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.
A "standard library" of Triton kernels.
Manifold-Constrained Hyper-Connections with fused Triton kernels for efficient training
Educational resource demonstrating common GPU programming pitfalls and solutions using Triton kernels.
Official Code for the paper ELMO : Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces (in ICML 2025)
KernelHeim – development ground of custom Triton and CUDA kernel functions designed to optimize and accelerate machine learning workloads on NVIDIA GPUs. Inspired by the mythical stronghold of the gods, KernelHeim is a forge where high-performance kernels are crafted to unlock the full potential of the hardware.
A container of various PyTorch neural network modules written in Triton.
Repository for learning Triton GPU programming
FlashAttention2 Analysis in Triton
💥 Optimize linear attention models with efficient Triton-based implementations in PyTorch, compatible across NVIDIA, AMD, and Intel platforms.
Yandex LLM Scaling Week 2025
A memory-efficient and CUDA-independent Triton implementation of Sparse Convolution, optimized for high-performance 3D Perception.
Add a description, image, and links to the triton-kernels topic page so that developers can more easily learn about it.
To associate your repository with the triton-kernels topic, visit your repo's landing page and select "manage topics."