flashinfer.quantization.kernels.mxfp4_quantize.mxfp4_quantize_cute_dsl

flashinfer.quantization.kernels.mxfp4_quantize.mxfp4_quantize_cute_dsl(input: Tensor, sf_layout: int = 0, enable_pdl: bool | None = None) Tuple[Tensor, Tensor]

Quantize input tensor to MXFP4 format using the CuTe-DSL kernel.

GPU implementation with dual-path optimization:

  • LINEAR layout: flat SF-block iteration with adaptive 1T/4T per SF block. 4T/SF is used on low-SM GPUs (<=80 SMs) for coalesced memory access; 1T/SF on high-SM GPUs where enough SMs generate sufficient outstanding memory requests.

  • SWIZZLED layout: row-based iteration with a padding fast path.

The kernel is compiled once per (K, dtype, sf_layout, pdl, use_4t) tuple and handles varying M (batch size) at runtime without recompilation.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16.

  • sf_layout (int) – Scale-factor layout (0=128x4, 1=8x4, 2=linear).

  • enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when None.

Returns:

(fp4_tensor, scale_tensor) where fp4_tensor is the quantized tensor of shape [M, K/2] with dtype uint8 and scale_tensor are the UE8M0 scale factors (uint8) reshaped to [padded_rows, K/32].

Return type:

Tuple[torch.Tensor, torch.Tensor]