flashinfer.quantization.mxfp4_quantize¶
- flashinfer.quantization.mxfp4_quantize(a: Tensor, backend: str = 'cuda', enable_pdl: bool | None = None) Tuple[Tensor, Tensor]¶
Quantize input tensor to MXFP4 format.
- Parameters:
a (torch.Tensor) – Input tensor of shape
[M, K]with dtype fp16/bf16.backend (str) –
Backend to use for quantization:
"cuda": stable CUDA kernel (default)."cute-dsl": CuTe-DSL kernel (SM100+, experimental).
enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Only used when
backend == "cute-dsl". Auto-detected from device capability whenNone.
- Returns:
(x_q, sf)wherex_qhas shape[M, K/2]with dtypeuint8(FLOAT4_E2M1X2) andsfis the UE8M0 scale-factor tensor (uint8) whose shape depends on the chosen layout andsf_vec_size(fixed at32here).- Return type:
Tuple[torch.Tensor, torch.Tensor]
Warning
The
"cute-dsl"backend is experimental and not part of the stable API. It may change or be removed in future versions without notice. Use at your own risk for production workloads.