flashinfer.quantization.nvfp4_kv_quantize¶
- flashinfer.quantization.nvfp4_kv_quantize(input: Tensor, global_scale: Tensor) Tuple[Tensor, Tensor]¶
GPU quantization to the NVFP4 KV-cache format with linear block-scale layout.
Requires SM100+ (Blackwell) for the
cvt.rn.satfinite.e2m1x2.f32PTX instruction.- Parameters:
input (torch.Tensor) – Input tensor of shape
[M, K]with dtype bf16 or fp16;Kmust be divisible by 16.global_scale (torch.Tensor) – Global scale factor of shape
[1]with dtypefloat32, on the same CUDA device asinput.
- Returns:
(fp4_output, block_scales)wherefp4_outputis packed FP4 data of shape[M, K/2]with dtypeuint8andblock_scalesare per-block FP8 E4M3 scales of shape[M, K/16]with dtypeuint8.- Return type:
Tuple[torch.Tensor, torch.Tensor]