flashinfer.fp4_quantization.nvfp4_quantize¶
- flashinfer.fp4_quantization.nvfp4_quantize(a, a_global_sf, sfLayout=SfLayout.layout_128x4, do_shuffle=False, sf_vec_size=16)¶
Quantize input tensor to NVFP4 format.
- Parameters:
a (torch.Tensor) – Input tensor of shape [M, K] with dtype fp16/bf16.
a_global_sf (torch.Tensor) – Global scale factor of shape [1] with dtype float32.
sfLayout (SfLayout, optional) – Scale factor layout. Defaults to SfLayout.layout_128x4.
do_shuffle (bool, optional) – Whether to shuffle the scale factors. Defaults to False. Only TRTLLM backend needs to shuffle the tensor B scale factors.
sf_vec_size (int, optional) – Scale factor vector size. Defaults to 16.
- Returns:
- A tuple containing:
Quantized tensor of shape [M, K/2] with dtype FLOAT4_E2M1X2
Scale factors tensor with shape determined by layout and sf_vec_size
- Return type:
Tuple[torch.Tensor, torch.Tensor]