flashinfer.quantization.mxfp8_quantize¶
- flashinfer.quantization.mxfp8_quantize(input: Tensor, is_sf_swizzled_layout: bool = True, alignment: int = 32, enable_pdl: bool | None = None, backend: Literal['cuda', 'cute-dsl'] = 'cuda', sf_swizzle_layout: SfLayout | None = None) Tuple[Tensor, Tensor]¶
Quantize input tensor to MxFP8 format.
Implements MxFP8 quantization that converts input tensors to a compressed MxFP8 format with associated scale factors. Supports various input data types and scale-factor layouts.
- Parameters:
input (torch.Tensor) – Input tensor of shape
[M, K]with dtype fp16/bf16/fp8_quantized.is_sf_swizzled_layout (bool) – Whether to use the swizzled layout for scale factors. Defaults to
True.alignment (int) –
sfVecSize. Defaults to32.enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when
None.backend ({"cuda", "cute-dsl"}) –
Backend to use:
"cuda": stable JIT-compiled CUDA kernel (default)."cute-dsl": CuTe-DSL kernel (SM100+, experimental).
sf_swizzle_layout (SfLayout, optional) – Swizzle layout for scale factors; when supplied this overrides
is_sf_swizzled_layout.
- Returns:
(x_q, sf)wherex_qhas shape[M, K]with dtypeFLOAT8_E4M3andsfis the scale-factor tensor whose shape depends on the chosen layout andsf_vec_size(fixed at32here).- Return type:
Tuple[torch.Tensor, torch.Tensor]
Warning
The
"cute-dsl"backend is experimental and not part of the stable API. It may change or be removed in future versions without notice.