flashinfer.fp4_quantization.fp4_quantize

flashinfer.fp4_quantization.fp4_quantize(input: Tensor, global_scale: Tensor | None = None, sf_vec_size: int = 16, sf_use_ue8m0: bool = False, is_sf_swizzled_layout: bool = True, is_sf_8x4_layout: bool = False, enable_pdl: bool | None = None, backend: str = 'cuda') Tuple[Tensor, Tensor]

Quantize input tensor to FP4 format.

This function implements FP4 quantization that converts input tensors to a compressed FP4 format with associated scale factors. It supports various input data types and scale factor layouts.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16/fp8_quantized.

  • global_scale (torch.Tensor, optional) – Global scale factor of shape [1] and dtype float32.

  • sf_vec_size (int, optional) – Scale factor vector size. Defaults to 16.

  • sf_use_ue8m0 (bool, optional) – Whether to use UE8M0 format for scale factors. Defaults to False.

  • is_sf_swizzled_layout (bool, optional) – Whether to use swizzled layout for scale factors. Defaults to True.

  • is_sf_8x4_layout (bool, optional) – Whether to use 8x4 layout or 128x4 layout for scale factors. Defaults to False.

  • enable_pdl (Optional[bool], optional) – Whether to enable PDL (Programmatic Dependent Launch). If None, automatically detects based on device capability. Defaults to None.

  • backend (str, optional) –

    Backend to use for quantization. - “cuda”: Use CUDA kernel (default, stable). - “cute-dsl”: Use CuTe-DSL kernel (requires SM100+, experimental).

    Supported combinations: * sf_vec_size=16, sf_use_ue8m0=False: all layouts, fp16/bf16/fp8 (NVFP4) * sf_vec_size=32, sf_use_ue8m0=True: 128x4 swizzled and linear, fp16/bf16 (MXFP4)

Returns:

A tuple containing:
  • Quantized tensor of shape [M, K/2] with dtype FLOAT4_E2M1X2

  • Scale factors tensor with shape determined by layout and sf_vec_size

Return type:

Tuple[torch.Tensor, torch.Tensor]

Raises:
  • NotImplementedError – If any of the following features are requested but not implemented: - BFloat16 input when BFloat16 is not enabled - FP8 input when FP8 is not enabled - sf_vec_size other than 16 or 32

  • ValueError – If the “cute-dsl” backend is requested for an unsupported parameter combination.

Warning

The “cute-dsl” backend is experimental and not part of the stable API. It may change or be removed in future versions without notice.