Welcome to FlashInfer’s documentation!¶

Blog | Discussion Forum | GitHub

FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.

Get Started

Installation
- Python Package
- C++ API

Tutorials

Attention States and Recursive Attention
- Applications
- Related APIs
KV-Cache Layout in FlashInfer

PyTorch API Reference

flashinfer.decode
- Single Request Decoding
- Batch Decoding
flashinfer.prefill
- Single Request Prefill/Append Attention
- Batch Prefill/Append Attention
flashinfer.cascade
flashinfer.mla
- PageAttention for MLA
flashinfer.sparse
- BlockSparseAttentionWrapper
- VariableBlockSparseAttentionWrapper
flashinfer.page
- Append new K/V tensors to Paged KV-Cache
flashinfer.sampling
flashinfer.logits_processor
flashinfer.gemm
- FP8 Batch GEMM
- Grouped GEMM
flashinfer.norm
flashinfer.rope
flashinfer.activation
- Up/Gate output activation
flashinfer.quantization
- flashinfer.quantization.packbits
- flashinfer.quantization.segment_packbits
flashinfer.green_ctx
- Green context creation