flashinfer.quantization.kernels.mxfp4_quantize.mxfp4_quantize_cute_dsl¶
- flashinfer.quantization.kernels.mxfp4_quantize.mxfp4_quantize_cute_dsl(input: Tensor, sf_layout: int = 0, enable_pdl: bool | None = None) Tuple[Tensor, Tensor]¶
Quantize input tensor to MXFP4 format using the CuTe-DSL kernel.
GPU implementation with dual-path optimization:
LINEAR layout: flat SF-block iteration with adaptive 1T/4T per SF block. 4T/SF is used on low-SM GPUs (<=80 SMs) for coalesced memory access; 1T/SF on high-SM GPUs where enough SMs generate sufficient outstanding memory requests.
SWIZZLED layout: row-based iteration with a padding fast path.
The kernel is compiled once per
(K, dtype, sf_layout, pdl, use_4t)tuple and handles varyingM(batch size) at runtime without recompilation.- Parameters:
input (torch.Tensor) – Input tensor of shape
[M, K]with dtype fp16/bf16.sf_layout (int) – Scale-factor layout (
0=128x4,1=8x4,2=linear).enable_pdl (bool, optional) – Whether to enable Programmatic Dependent Launch. Auto-detected from device capability (SM >= 9.0) when
None.
- Returns:
(fp4_tensor, scale_tensor)wherefp4_tensoris the quantized tensor of shape[M, K/2]with dtypeuint8andscale_tensorare the UE8M0 scale factors (uint8) reshaped to[padded_rows, K/32].- Return type:
Tuple[torch.Tensor, torch.Tensor]