flashinfer.quantization

Quantization-related kernels for FP4, FP8, and packbits utilities.

Types and Enums

SfLayout(value[, names, module, qualname, ...])

Layout of scale factors for quantization.

Packbits Utilities

packbits(x[, bitorder])

Pack the elements of a binary-valued array into bits in a uint8 array.

segment_packbits(x, indptr[, bitorder])

Pack a batch of binary-valued segments into bits in a uint8 array.

FP4 Quantization

Core kernels for NVFP4 / MXFP4 (de)quantization and the scale-factor layout helpers used by the FP4 GEMM/MoE pipelines.

fp4_quantize(input[, global_scale, ...])

Quantize input tensor to FP4 format.

nvfp4_quantize(a, a_global_sf[, sfLayout, ...])

Quantize input tensor to NVFP4 format.

nvfp4_batched_quantize(a, a_global_sf[, ...])

Quantize batched input tensor to NVFP4 format.

mxfp4_quantize(a[, backend, enable_pdl])

Quantize input tensor to MXFP4 format.

mxfp4_dequantize(a_fp4, a_sf)

Dequantize MXFP4 packed weights back to float32.

mxfp4_dequantize_host(weight, scale[, ...])

Host-side MXFP4 dequantization.

block_scale_interleave(unswizzled_sf)

Swizzle a block-scale tensor for FP4 layouts.

e2m1_and_ufp8sf_scale_to_float(e2m1_tensor, ...)

Dequantize an E2M1 tensor with UFP8 scales back to float32.

scaled_fp4_grouped_quantize(a, mask, a_global_sf)

Quantize a batched input tensor to NVFP4 with a per-row mask.

shuffle_matrix_a(input_tensor, epilogue_tile_m)

PyTorch equivalent of TRT-LLM-gen shuffleMatrixA.

shuffle_matrix_sf_a(input_tensor, ...[, ...])

CUDA implementation of TRT-LLM-gen shuffleMatrixSfA for linear-layout SF.

Note

flashinfer.quantization.nvfp4_block_scale_interleave is an alias for block_scale_interleave() (same Python object). Use either name; we document the canonical block_scale_interleave to avoid Sphinx duplicate object description warnings under -W.

FP4 KV Cache Quantization

GPU-accelerated quantization / dequantization for KV-cache data using the linear (non-swizzled) block-scale layout.

nvfp4_kv_quantize(input, global_scale)

GPU quantization to the NVFP4 KV-cache format with linear block-scale layout.

nvfp4_kv_dequantize(fp4_data, block_scales, ...)

GPU dequantization of an NVFP4 KV cache with linear block-scale layout.

nvfp4_quantize_paged_kv_cache(k_cache, v_cache)

Quantize a paged KV cache to NVFP4 for the trtllm-gen MHA kernel.

FP8 Quantization

mxfp8_quantize(input[, ...])

Quantize input tensor to MxFP8 format.

mxfp8_dequantize_host(input, scale_tensor[, ...])

Host-side dequantization of an MxFP8 tensor back to float32.

CuTe-DSL Quantization Kernels (experimental)

The CuTe-DSL backends are conditionally available when the nvidia-cutlass-dsl package is installed. At runtime they are also re-exported as flashinfer.quantization.{nvfp4,mxfp4,mxfp8}_quantize_cute_dsl when available; documenting them here via their canonical submodule path keeps the docs build from depending on the CuTe-DSL stack being importable.

nvfp4_quantize_cute_dsl(input, global_scale)

Quantize input tensor to NVFP4 format using the CuTe-DSL kernel.

nvfp4_quantize_per_token_cute_dsl(input, ...)

Per-token NVFP4 activation quantization using the CuTe-DSL kernel.

mxfp4_quantize_cute_dsl(input[, sf_layout, ...])

Quantize input tensor to MXFP4 format using the CuTe-DSL kernel.

mxfp8_quantize_cute_dsl(input[, ...])

Quantize input tensor to MXFP8 format using the CuTe-DSL kernel.