flashinfer.fp4_quantization.nvfp4_kv_quantize¶
- flashinfer.fp4_quantization.nvfp4_kv_quantize(input: Tensor, global_scale: Tensor) Tuple[Tensor, Tensor]¶
GPU quantization to NVFP4 KV cache format with linear block scale layout.
Requires SM100+ (Blackwell) for the cvt.rn.satfinite.e2m1x2.f32 PTX instruction.
- Parameters:
input (torch.Tensor) – Input tensor of shape [M, K] with dtype bf16 or fp16. K must be divisible by 16.
global_scale (torch.Tensor) – Global scale factor of shape
[1]with dtype float32, on the same CUDA device as input.
- Returns:
fp4_output: Packed FP4 data of shape
[M, K/2]with dtype uint8.block_scales: Per-block FP8 E4M3 scales of shape
[M, K/16]with dtype uint8.
- Return type:
Tuple[torch.Tensor, torch.Tensor]