flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_cute_dsl

flashinfer.quantization.kernels.nvfp4_quantize.nvfp4_quantize_cute_dsl(input: Tensor, global_scale: Tensor, sf_layout: int = 0, enable_pdl: bool | None = None) Tuple[Tensor, Tensor]

Quantize input tensor to NVFP4 format using the CuTe-DSL kernel.

GPU implementation matching flashinfer.quantization.nvfp4_quantize():

  • E4M3 scale factors (FP8)

  • E2M1 output format (4-bit, 2 values per byte)

  • Supports 128x4, 8x4, and linear scale-factor layouts

  • sf_vec_size = 16

The kernel is compiled once per (K, dtype, sf_layout, pdl) tuple and handles varying M (batch size) at runtime without recompilation.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16/float8_e4m3fn.

  • global_scale (torch.Tensor) – Scalar tensor (float32) for the NVFP4 global scale factor.

  • sf_layout (int) – Scale-factor layout (0=128x4, 1=8x4, 2=linear).

  • enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when None.

Returns:

(fp4_tensor, scale_tensor) where fp4_tensor is the quantized tensor of shape [M, K/2] with dtype uint8 and scale_tensor holds the E4M3 scale factors (uint8) reshaped to [padded_rows, K/16].

Return type:

Tuple[torch.Tensor, torch.Tensor]