flashinfer.quantization¶

Quantization-related kernels for FP4, FP8, and packbits utilities.

Types and Enums¶

SfLayout(value[, names, module, qualname, ...])

Layout of scale factors for quantization.

Packbits Utilities¶

`packbits`(x[, bitorder])	Pack the elements of a binary-valued array into bits in a uint8 array.
`segment_packbits`(x, indptr[, bitorder])	Pack a batch of binary-valued segments into bits in a uint8 array.

FP4 Quantization¶

Core kernels for NVFP4 / MXFP4 (de)quantization and the scale-factor layout helpers used by the FP4 GEMM/MoE pipelines.

`fp4_quantize`(input[, global_scale, ...])	Quantize input tensor to FP4 format.
`nvfp4_quantize`(a, a_global_sf[, sfLayout, ...])	Quantize input tensor to NVFP4 format.
`nvfp4_batched_quantize`(a, a_global_sf[, ...])	Quantize batched input tensor to NVFP4 format.
`mxfp4_quantize`(a[, backend, enable_pdl, ...])	Quantize input tensor to MXFP4 format.
`mxfp4_dequantize`(a_fp4, a_sf[, sfLayout])	Dequantize MXFP4 packed weights back to float32.
`mxfp4_dequantize_host`(weight, scale[, ...])	Host-side MXFP4 dequantization.
`block_scale_interleave`(unswizzled_sf)	Swizzle a block-scale tensor for FP4 layouts.
`e2m1_and_ufp8sf_scale_to_float`(e2m1_tensor, ...)	Dequantize an E2M1 tensor with UFP8 scales back to float32.
`scaled_fp4_grouped_quantize`(a, mask, a_global_sf)	Quantize a batched input tensor to NVFP4 with a per-row mask.
`silu_and_mul_nvfp4_quantize`(input, global_scale)	Apply SwiGLU and NVFP4 quantization in one CuTe-DSL kernel.
`shuffle_matrix_a`(input_tensor, epilogue_tile_m)	PyTorch equivalent of TRT-LLM-gen `shuffleMatrixA`.
`shuffle_matrix_sf_a`(input_tensor, ...[, ...])	CUDA implementation of TRT-LLM-gen `shuffleMatrixSfA` for linear-layout SF.

Note

flashinfer.quantization.nvfp4_block_scale_interleave is an alias for block_scale_interleave() (same Python object). Use either name; we document the canonical block_scale_interleave to avoid Sphinx duplicate object description warnings under -W.

FP4 KV Cache Quantization¶

GPU-accelerated quantization / dequantization for KV-cache data using the linear (non-swizzled) block-scale layout.

nvfp4_kv_dequantize(): SM80+ (Ampere and later)
nvfp4_kv_dequantize_paged(): SM80+ (Ampere and later)
nvfp4_kv_quantize(): SM100+ (Blackwell and later)
nvfp4_quantize_paged_kv_cache()

`nvfp4_kv_quantize`(input, global_scale)	GPU quantization to the NVFP4 KV-cache format with linear block-scale layout.
`nvfp4_kv_dequantize`(fp4_data, block_scales, ...)	GPU dequantization of an NVFP4 KV cache with linear block-scale layout.
`nvfp4_kv_dequantize_paged`(paged_kv_cache, ...)	Dequantize a paged NVFP4 KV cache into caller-owned contiguous outputs.
`nvfp4_quantize_paged_kv_cache`(k_cache, v_cache)	Quantize a paged KV cache to NVFP4 for the trtllm-gen MHA kernel.

FP8 Quantization¶

`mxfp8_quantize`(input[, ...])	Quantize input tensor to MxFP8 format.
`mxfp8_grouped_quantize`(a, mask)	Quantize grouped inputs to MXFP8 with UE8M0 block scales.
`mxfp8_dequantize_host`(input, scale_tensor[, ...])	Host-side dequantization of an MxFP8 tensor back to float32.

Note

mxfp8_grouped_quantize uses a cuTile backend and requires SM100+ and cuda.tile (a requirements.txt dependency). K must be divisible by 32 and is padded internally to 128-column tiles.

CuTe-DSL Quantization Kernels (experimental)¶

The CuTe-DSL backends are conditionally available when the nvidia-cutlass-dsl package is installed. At runtime they are also re-exported as flashinfer.quantization.{nvfp4,mxfp4,mxfp8}_quantize_cute_dsl when available; documenting them here via their canonical submodule path keeps the docs build from depending on the CuTe-DSL stack being importable.

`nvfp4_quantize_cute_dsl`(input, global_scale)	Quantize input tensor to NVFP4 format using the CuTe-DSL kernel.
`nvfp4_quantize_per_token_cute_dsl`(input, ...)	Per-token NVFP4 activation quantization using the CuTe-DSL kernel.

mxfp4_quantize_cute_dsl(input[, sf_layout, ...])

Quantize input tensor to MXFP4 format using the CuTe-DSL kernel.

mxfp8_quantize_cute_dsl(input[, ...])

Quantize input tensor to MXFP8 format using the CuTe-DSL kernel.