flashinfer.quantization.mxfp4_quantize

flashinfer.quantization.mxfp4_quantize(a: Tensor, backend: str = 'cuda', enable_pdl: bool | None = None) Tuple[Tensor, Tensor]

Quantize input tensor to MXFP4 format.

Parameters:
  • a (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16.

  • backend (str) –

    Backend to use for quantization:

    • "cuda": stable CUDA kernel (default).

    • "cute-dsl": CuTe-DSL kernel (SM100+, experimental).

  • enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Only used when backend == "cute-dsl". Auto-detected from device capability when None.

Returns:

(x_q, sf) where x_q has shape [M, K/2] with dtype uint8 (FLOAT4_E2M1X2) and sf is the UE8M0 scale-factor tensor (uint8) whose shape depends on the chosen layout and sf_vec_size (fixed at 32 here).

Return type:

Tuple[torch.Tensor, torch.Tensor]

Warning

The "cute-dsl" backend is experimental and not part of the stable API. It may change or be removed in future versions without notice. Use at your own risk for production workloads.