Welcome to FlashInfer’s documentation!¶
Blog | Discussion Forum | GitHub
FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
- flashinfer.decode
- flashinfer.prefill
- flashinfer.cascade
- flashinfer.sparse
- flashinfer.page
- flashinfer.sampling
- flashinfer.sampling.sampling_from_probs
- flashinfer.sampling.top_p_sampling_from_probs
- flashinfer.sampling.top_k_sampling_from_probs
- flashinfer.sampling.min_p_sampling_from_probs
- flashinfer.sampling.top_k_top_p_sampling_from_logits
- flashinfer.sampling.top_k_top_p_sampling_from_probs
- flashinfer.sampling.top_p_renorm_probs
- flashinfer.sampling.top_k_renorm_probs
- flashinfer.sampling.top_k_mask_logits
- flashinfer.sampling.chain_speculative_sampling
- flashinfer.gemm
- flashinfer.norm
- flashinfer.rope
- flashinfer.rope.apply_rope_inplace
- flashinfer.rope.apply_llama31_rope_inplace
- flashinfer.rope.apply_rope
- flashinfer.rope.apply_llama31_rope
- flashinfer.rope.apply_rope_pos_ids
- flashinfer.rope.apply_rope_pos_ids_inplace
- flashinfer.rope.apply_llama31_rope_pos_ids
- flashinfer.rope.apply_llama31_rope_pos_ids_inplace
- flashinfer.rope.apply_rope_with_cos_sin_cache
- flashinfer.rope.apply_rope_with_cos_sin_cache_inplace
- flashinfer.activation
- flashinfer.quantization