flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_per_token_cute_dsl¶
- flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_per_token_cute_dsl(input: Tensor, global_scale_inv: Tensor, sf_layout: int = 0, enable_pdl: bool | None = None) Tuple[Tensor, Tensor, Tensor]¶
Per-token NVFP4 activation quantization using the CuTe-DSL kernel.
Unlike
nvfp4_quantize_cute_dsl(), which applies a single global scale, this variant computes one quantization scale per row (token) of the activation. Each row is scaled independently so that its largest magnitude maps to the NVFP4 dynamic range, and the resulting per-token scale is returned alongside the packed FP4 output and the E4M3 block scale factors.E4M3 block scale factors (FP8),
sf_vec_size = 16E2M1 output format (4-bit, 2 values per byte)
Supports 128x4, 8x4, and linear scale-factor layouts
The kernel is compiled once per
(K, dtype, sf_layout, pdl)tuple and handles varyingM(number of tokens) at runtime without recompilation.- Parameters:
input (torch.Tensor) – 2-D activation tensor of shape
[M, K]with dtype fp16/bf16.Kmust be divisible byNVFP4_SF_VEC_SIZE(16).global_scale_inv (torch.Tensor) – Scalar tensor (
float32) holding the inverse global scale applied on top of the per-token scale. A Pythonfloatis also accepted and wrapped into a tensor internally.sf_layout (int) – Scale-factor layout (
0=128x4,1=8x4,2=linear).enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when
None; passFalseto force it off.
- Returns:
(fp4_output, scale_output, per_token_scale)where:fp4_outputis the packed quantized tensor of shape[M, K/2]with dtypeuint8(two E2M1 values per byte).scale_outputholds the E4M3 block scale factors (uint8) reshaped to[padded_rows, padded_sf_cols]. The padding depends onsf_layout:linearkeepsMrows, while128x4/8x4pad rows and columns up to the layout tile.per_token_scaleis the per-row quantization scale of shape[M]with dtypefloat32.
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]