flashinfer.quantization.fp4_quantize

flashinfer.quantization.fp4_quantize(input: Tensor, global_scale: Tensor | None = None, sf_vec_size: int = 16, sf_use_ue8m0: bool = False, is_sf_swizzled_layout: bool = True, is_sf_8x4_layout: bool = False, is_global_scale_inversed: bool = False, enable_pdl: bool | None = None, backend: str = 'cuda') Tuple[Tensor, Tensor]

Quantize input tensor to FP4 format.

Implements FP4 quantization that converts input tensors to a compressed FP4 format with associated scale factors. Supports various input data types and scale-factor layouts (covering both NVFP4 and MXFP4 quantization recipes).

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16/fp8_quantized.

  • global_scale (torch.Tensor, optional) – Global scale factor of shape [1] and dtype float32.

  • sf_vec_size (int) – Scale factor vector size. Defaults to 16.

  • sf_use_ue8m0 (bool) – Whether to use UE8M0 format for scale factors. Defaults to False.

  • is_sf_swizzled_layout (bool) – Whether to use the swizzled layout for scale factors. Defaults to True.

  • is_sf_8x4_layout (bool) – Use the 8x4 swizzled layout instead of 128x4. Defaults to False.

  • is_global_scale_inversed (bool) – When True, global_scale is interpreted as the inverse scale. Defaults to False.

  • enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability when None.

  • backend (str) –

    Backend to use for quantization:

    • "cuda": stable CUDA kernel (default).

    • "cute-dsl": CuTe-DSL kernel (SM100+, experimental). Supported combinations:

      • sf_vec_size=16, sf_use_ue8m0=False: all layouts, fp16/bf16/fp8 (NVFP4).

      • sf_vec_size=32, sf_use_ue8m0=True: all layouts, fp16/bf16 (MXFP4).

Returns:

(x_q, sf) where x_q has shape [M, K/2] with dtype FLOAT4_E2M1X2 and sf is the scale-factor tensor whose shape depends on the layout and sf_vec_size.

Return type:

Tuple[torch.Tensor, torch.Tensor]

Raises:
  • NotImplementedError – If the requested feature is not implemented (e.g. BFloat16 input when BFloat16 is not enabled, FP8 input when FP8 is not enabled, or sf_vec_size other than 16 or 32).

  • ValueError – If the "cute-dsl" backend is requested for an unsupported parameter combination.

Warning

The "cute-dsl" backend is experimental and not part of the stable API. It may change or be removed in future versions without notice.