flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_cute_dsl¶

flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_cute_dsl(input: Tensor, global_scale: Tensor, sf_layout: int = 0, enable_pdl: bool | None = None) → Tuple[Tensor, Tensor]¶

Quantize input tensor to NVFP4 format using the CuTe-DSL kernel.

GPU implementation matching flashinfer.quantization.nvfp4_quantize():

E4M3 scale factors (FP8)
E2M1 output format (4-bit, 2 values per byte)
Supports 128x4, 8x4, and linear scale-factor layouts
sf_vec_size = 16

The kernel is compiled once per (K, dtype, sf_layout, pdl) tuple and handles varying M (batch size) at runtime without recompilation.

Parameters:

input (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16/float8_e4m3fn.
global_scale (torch.Tensor) – Scalar tensor (float32) for the NVFP4 global scale factor.
sf_layout (int) – Scale-factor layout (0=128x4, 1=8x4, 2=linear).
enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when None.

Returns:

(fp4_tensor, scale_tensor) where fp4_tensor is the quantized tensor of shape [M, K/2] with dtype uint8 and scale_tensor holds the E4M3 scale factors (uint8) reshaped to [padded_rows, K/16].

Return type:

Tuple[torch.Tensor, torch.Tensor]