Welcome to FlashInfer’s documentation!¶
Blog | Discussion Forum | GitHub
FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
Get Started
Tutorials
PyTorch API Reference
- FlashInfer Attention Kernels
 - flashinfer.gemm
 - flashinfer.fused_moe
 - flashinfer.cascade
 - flashinfer.comm
 - flashinfer.sparse
 - flashinfer.page
 - flashinfer.sampling
- flashinfer.sampling.sampling_from_probs
 - flashinfer.sampling.top_p_sampling_from_probs
 - flashinfer.sampling.top_k_sampling_from_probs
 - flashinfer.sampling.min_p_sampling_from_probs
 - flashinfer.sampling.top_k_top_p_sampling_from_logits
 - flashinfer.sampling.top_k_top_p_sampling_from_probs
 - flashinfer.sampling.top_p_renorm_probs
 - flashinfer.sampling.top_k_renorm_probs
 - flashinfer.sampling.top_k_mask_logits
 - flashinfer.sampling.chain_speculative_sampling
 
 - flashinfer.logits_processor
 - flashinfer.norm
 - flashinfer.rope
- flashinfer.rope.apply_rope_inplace
 - flashinfer.rope.apply_llama31_rope_inplace
 - flashinfer.rope.apply_rope
 - flashinfer.rope.apply_llama31_rope
 - flashinfer.rope.apply_rope_pos_ids
 - flashinfer.rope.apply_rope_pos_ids_inplace
 - flashinfer.rope.apply_llama31_rope_pos_ids
 - flashinfer.rope.apply_llama31_rope_pos_ids_inplace
 - flashinfer.rope.apply_rope_with_cos_sin_cache
 - flashinfer.rope.apply_rope_with_cos_sin_cache_inplace
 
 - flashinfer.activation
 - flashinfer.quantization
 - flashinfer.green_ctx
 - flashinfer.fp4_quantization
 - flashinfer.testing