flashinfer.quantization.mxfp8_quantize

flashinfer.quantization.mxfp8_quantize(input: Tensor, is_sf_swizzled_layout: bool = True, alignment: int = 32, enable_pdl: bool | None = None, backend: Literal['cuda', 'cute-dsl'] = 'cuda', sf_swizzle_layout: SfLayout | None = None) Tuple[Tensor, Tensor]

Quantize input tensor to MxFP8 format.

Implements MxFP8 quantization that converts input tensors to a compressed MxFP8 format with associated scale factors. Supports various input data types and scale-factor layouts.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16/fp8_quantized.

  • is_sf_swizzled_layout (bool) – Whether to use the swizzled layout for scale factors. Defaults to True.

  • alignment (int) – sfVecSize. Defaults to 32.

  • enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when None.

  • backend ({"cuda", "cute-dsl"}) –

    Backend to use:

    • "cuda": stable JIT-compiled CUDA kernel (default).

    • "cute-dsl": CuTe-DSL kernel (SM100+, experimental).

  • sf_swizzle_layout (SfLayout, optional) – Swizzle layout for scale factors; when supplied this overrides is_sf_swizzled_layout.

Returns:

(x_q, sf) where x_q has shape [M, K] with dtype FLOAT8_E4M3 and sf is the scale-factor tensor whose shape depends on the chosen layout and sf_vec_size (fixed at 32 here).

Return type:

Tuple[torch.Tensor, torch.Tensor]

Warning

The "cute-dsl" backend is experimental and not part of the stable API. It may change or be removed in future versions without notice.