flashinfer.quantization.kernels.mxfp8_quantize.mxfp8_quantize_cute_dsl¶
- flashinfer.quantization.kernels.mxfp8_quantize.mxfp8_quantize_cute_dsl(input: Tensor, is_sf_swizzled_layout: bool = True, alignment: int = 32, enable_pdl: bool | None = None, is_sf_8x4_layout: bool = False) Tuple[Tensor, Tensor]¶
Quantize input tensor to MXFP8 format using the CuTe-DSL kernel.
GPU implementation with dual-path optimization:
LINEAR layout: SF-block based iteration (fast).
SWIZZLED layout: row-based iteration with a padding fast path.
The kernel is compiled once per
(K, dtype, pdl, sf_layout)tuple and handles varyingM(batch size) at runtime without recompilation.- Parameters:
input (torch.Tensor) – Input tensor of shape
[M, K]with dtype fp16/bf16.is_sf_swizzled_layout (bool) – Whether to use a swizzled layout (
True) or linear (False). WhenTrue, the layout is 128x4 by default; passis_sf_8x4_layout=Truefor 8x4.alignment (int) – Alignment for the K dimension (default 32; must be a multiple of
SF_VEC_SIZE).enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when
None.is_sf_8x4_layout (bool) – When
is_sf_swizzled_layoutisTrue, selects 8x4 swizzling instead of the default 128x4. Ignored for the linear layout.
- Returns:
(fp8_tensor, scale_tensor)wherefp8_tensoris the quantized tensor of shape[M, padded_K]with dtypefloat8_e4m3fnandscale_tensoris the UE8M0 scale-factor tensor (uint8).- Return type:
Tuple[torch.Tensor, torch.Tensor]