Welcome to FlashInfer’s documentation!¶

Blog | Discussion Forum | GitHub

FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.

Get Started

Installation
- Python Package
Command Line Interface
Logging

Tutorials

Attention States and Recursive Attention
- Applications
- Related APIs
KV-Cache Layout in FlashInfer

PyTorch API Reference

FlashInfer Attention Kernels
flashinfer.gemm
flashinfer.fused_moe
flashinfer.cascade
flashinfer.comm
flashinfer.sparse
- BlockSparseAttentionWrapper
- VariableBlockSparseAttentionWrapper
flashinfer.page
- Append new K/V tensors to Paged KV-Cache
flashinfer.sampling
flashinfer.topk
- Top-K Selection
- Utility Functions
flashinfer.logits_processor
flashinfer.norm
flashinfer.rope
flashinfer.activation
- Up/Gate output activation
flashinfer.quantization
- flashinfer.quantization.packbits
- flashinfer.quantization.segment_packbits
flashinfer.green_ctx
- Green context creation
flashinfer.fp4_quantization
flashinfer.testing