flashinfer.quantization¶
Quantization-related kernels for FP4, FP8, and packbits utilities.
Types and Enums¶
|
Layout of scale factors for quantization. |
Packbits Utilities¶
|
Pack the elements of a binary-valued array into bits in a uint8 array. |
|
Pack a batch of binary-valued segments into bits in a uint8 array. |
FP4 Quantization¶
Core kernels for NVFP4 / MXFP4 (de)quantization and the scale-factor layout helpers used by the FP4 GEMM/MoE pipelines.
|
Quantize input tensor to FP4 format. |
|
Quantize input tensor to NVFP4 format. |
|
Quantize batched input tensor to NVFP4 format. |
|
Quantize input tensor to MXFP4 format. |
|
Dequantize MXFP4 packed weights back to float32. |
|
Host-side MXFP4 dequantization. |
|
Swizzle a block-scale tensor for FP4 layouts. |
|
Dequantize an E2M1 tensor with UFP8 scales back to float32. |
|
Quantize a batched input tensor to NVFP4 with a per-row mask. |
|
PyTorch equivalent of TRT-LLM-gen |
|
CUDA implementation of TRT-LLM-gen |
Note
flashinfer.quantization.nvfp4_block_scale_interleave is an alias
for block_scale_interleave() (same Python object). Use either
name; we document the canonical block_scale_interleave to avoid
Sphinx duplicate object description warnings under -W.
FP4 KV Cache Quantization¶
GPU-accelerated quantization / dequantization for KV-cache data using the linear (non-swizzled) block-scale layout.
nvfp4_kv_dequantize(): SM80+ (Ampere and later)nvfp4_kv_quantize(): SM100+ (Blackwell and later)
|
GPU quantization to the NVFP4 KV-cache format with linear block-scale layout. |
|
GPU dequantization of an NVFP4 KV cache with linear block-scale layout. |
|
Quantize a paged KV cache to NVFP4 for the trtllm-gen MHA kernel. |
FP8 Quantization¶
|
Quantize input tensor to MxFP8 format. |
|
Host-side dequantization of an MxFP8 tensor back to float32. |
CuTe-DSL Quantization Kernels (experimental)¶
The CuTe-DSL backends are conditionally available when the
nvidia-cutlass-dsl package is installed. At runtime they are also
re-exported as flashinfer.quantization.{nvfp4,mxfp4,mxfp8}_quantize_cute_dsl
when available; documenting them here via their canonical submodule
path keeps the docs build from depending on the CuTe-DSL stack being
importable.
|
Quantize input tensor to NVFP4 format using the CuTe-DSL kernel. |
|
Per-token NVFP4 activation quantization using the CuTe-DSL kernel. |
|
Quantize input tensor to MXFP4 format using the CuTe-DSL kernel. |
|
Quantize input tensor to MXFP8 format using the CuTe-DSL kernel. |