flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_cute_dsl¶
- flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_cute_dsl(input: Tensor, global_scale: Tensor, sf_layout: int = 0, enable_pdl: bool | None = None) Tuple[Tensor, Tensor]¶
Quantize input tensor to NVFP4 format using the CuTe-DSL kernel.
GPU implementation matching
flashinfer.quantization.nvfp4_quantize():E4M3 scale factors (FP8)
E2M1 output format (4-bit, 2 values per byte)
Supports 128x4, 8x4, and linear scale-factor layouts
sf_vec_size = 16
The kernel is compiled once per
(K, dtype, sf_layout, pdl)tuple and handles varyingM(batch size) at runtime without recompilation.- Parameters:
input (torch.Tensor) – Input tensor of shape
[M, K]with dtype fp16/bf16/float8_e4m3fn.global_scale (torch.Tensor) – Scalar tensor (
float32) for the NVFP4 global scale factor.sf_layout (int) – Scale-factor layout (
0=128x4,1=8x4,2=linear).enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when
None.
- Returns:
(fp4_tensor, scale_tensor)wherefp4_tensoris the quantized tensor of shape[M, K/2]with dtypeuint8andscale_tensorholds the E4M3 scale factors (uint8) reshaped to[padded_rows, K/16].- Return type:
Tuple[torch.Tensor, torch.Tensor]