Welcome to FlashInfer’s documentation!#
Blog | Discussion Forum | GitHub
FlashInfer is a library for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
- flashinfer.decode
- flashinfer.prefill
- flashinfer.cascade
- flashinfer.sparse
- flashinfer.page
- flashinfer.sampling
- flashinfer.sampling.sampling_from_probs
- flashinfer.sampling.top_p_sampling_from_probs
- flashinfer.sampling.top_k_sampling_from_probs
- flashinfer.sampling.min_p_sampling_from_probs
- flashinfer.sampling.top_k_top_p_sampling_from_logits
- flashinfer.sampling.top_k_top_p_sampling_from_probs
- flashinfer.sampling.top_p_renorm_probs
- flashinfer.sampling.top_k_renorm_probs
- flashinfer.sampling.top_k_mask_logits
- flashinfer.sampling.chain_speculative_sampling
- flashinfer.gemm
- flashinfer.norm
- flashinfer.rope
- flashinfer.quantization