flashinfer.fp4_quantization.nvfp4_kv_quantize

flashinfer.fp4_quantization.nvfp4_kv_quantize(input: Tensor, global_scale: Tensor) Tuple[Tensor, Tensor]

GPU quantization to NVFP4 KV cache format with linear block scale layout.

Requires SM100+ (Blackwell) for the cvt.rn.satfinite.e2m1x2.f32 PTX instruction.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype bf16 or fp16. K must be divisible by 16.

  • global_scale (torch.Tensor) – Global scale factor of shape [1] with dtype float32, on the same CUDA device as input.

Returns:

  • fp4_output: Packed FP4 data of shape [M, K/2] with dtype uint8.

  • block_scales: Per-block FP8 E4M3 scales of shape [M, K/16] with dtype uint8.

Return type:

Tuple[torch.Tensor, torch.Tensor]