flashinfer.fp4_quantization.nvfp4_quantize

flashinfer.fp4_quantization.nvfp4_quantize(a, a_global_sf, sfLayout=SfLayout.layout_128x4, do_shuffle=False, sf_vec_size=16, enable_pdl=None, backend: str = 'cuda')

Quantize input tensor to NVFP4 format.

Parameters:
  • a (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16/float8_e4m3fn.

  • a_global_sf (torch.Tensor) – Global scale factor of shape [1] with dtype float32.

  • sfLayout (SfLayout, optional) – Scale factor layout. Defaults to SfLayout.layout_128x4.

  • do_shuffle (bool, optional) – Whether to shuffle the scale factors. Defaults to False. Only TRTLLM backend needs to shuffle the tensor B scale factors.

  • sf_vec_size (int, optional) – Scale factor vector size. Defaults to 16.

  • enable_pdl (Optional[bool], optional) – Whether to enable PDL (Programmatic Dependent Launch). If None, automatically detects based on device capability. Defaults to None.

  • backend (str, optional) –

    Backend to use for quantization. - “cuda”: Use CUDA kernel (default, stable) - “cute-dsl”: Use CuTe-DSL kernel (requires SM100+, experimental).

    Supports all sfLayout values (layout_128x4, layout_8x4, layout_linear). Supports input dtypes: fp16, bf16, float8_e4m3fn. Only supports sf_vec_size=16.

Returns:

A tuple containing:
  • Quantized tensor of shape [M, K/2] with dtype FLOAT4_E2M1X2

  • Scale factors tensor with shape determined by layout and sf_vec_size

Return type:

Tuple[torch.Tensor, torch.Tensor]

Warning

The “cute-dsl” backend is experimental and not part of the stable API. It may change or be removed in future versions without notice.