flashinfer.quantization.kernels.mxfp8_quantize.mxfp8_quantize_cute_dsl

flashinfer.quantization.kernels.mxfp8_quantize.mxfp8_quantize_cute_dsl(input: Tensor, is_sf_swizzled_layout: bool = True, alignment: int = 32, enable_pdl: bool | None = None, is_sf_8x4_layout: bool = False) Tuple[Tensor, Tensor]

Quantize input tensor to MXFP8 format using the CuTe-DSL kernel.

GPU implementation with dual-path optimization:

  • LINEAR layout: SF-block based iteration (fast).

  • SWIZZLED layout: row-based iteration with a padding fast path.

The kernel is compiled once per (K, dtype, pdl, sf_layout) tuple and handles varying M (batch size) at runtime without recompilation.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16.

  • is_sf_swizzled_layout (bool) – Whether to use a swizzled layout (True) or linear (False). When True, the layout is 128x4 by default; pass is_sf_8x4_layout=True for 8x4.

  • alignment (int) – Alignment for the K dimension (default 32; must be a multiple of SF_VEC_SIZE).

  • enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when None.

  • is_sf_8x4_layout (bool) – When is_sf_swizzled_layout is True, selects 8x4 swizzling instead of the default 128x4. Ignored for the linear layout.

Returns:

(fp8_tensor, scale_tensor) where fp8_tensor is the quantized tensor of shape [M, padded_K] with dtype float8_e4m3fn and scale_tensor is the UE8M0 scale-factor tensor (uint8).

Return type:

Tuple[torch.Tensor, torch.Tensor]