flashinfer.quantization.nvfp4_quantize¶
- flashinfer.quantization.nvfp4_quantize(a, a_global_sf, sfLayout=SfLayout.layout_128x4, do_shuffle=False, sf_vec_size=16, enable_pdl=None, backend: str = 'cuda', per_token_activation: bool = False, expanded_idx_to_permuted_idx: Tensor | None = None)¶
Quantize input tensor to NVFP4 format.
- Parameters:
a (torch.Tensor) – Input tensor of shape
[M, K]with dtype fp16/bf16/float8_e4m3fn.a_global_sf (torch.Tensor) – Global scale factor of shape
[1]with dtypefloat32.sfLayout (SfLayout) – Scale-factor layout. Defaults to
SfLayout.layout_128x4.do_shuffle (bool) – Whether to shuffle the scale factors. Only the TRT-LLM backend needs to shuffle the tensor-B scale factors. Defaults to
False.sf_vec_size (int) – Scale-factor vector size. Defaults to
16.enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability when
None.backend (str) –
Backend to use for quantization:
"cuda": stable CUDA kernel (default)."cute-dsl": CuTe-DSL kernel (SM100+, experimental); supports allsfLayoutvalues (layout_128x4/layout_8x4/layout_linear) and input dtypes fp16/bf16/float8_e4m3fn, but onlysf_vec_size == 16.
per_token_activation (bool) – Whether to use per-token NVFP4 activation scaling. In this mode
a_global_sfis the inverse base scale multiplier (typically1 / (448 * 6)) and the function also returns per-token FP32 scales.expanded_idx_to_permuted_idx (torch.Tensor, optional) – Optional row-remapping buffer for per-token activation quantization.
- Returns:
(x_q, sf)wherex_qhas shape[M, K/2]with dtypeFLOAT4_E2M1X2andsfis the scale-factor tensor (shape depends on the layout andsf_vec_size). Whenper_token_activation=True, a third tensor containing per-token FP32 scales is also returned.- Return type:
Tuple[torch.Tensor, torch.Tensor]
Warning
The
"cute-dsl"backend is experimental and not part of the stable API. It may change or be removed in future versions without notice.