flashinfer.quantization.nvfp4_kv_quantize

flashinfer.quantization.nvfp4_kv_quantize(input: Tensor, global_scale: Tensor) Tuple[Tensor, Tensor]

GPU quantization to the NVFP4 KV-cache format with linear block-scale layout.

Requires SM100+ (Blackwell) for the cvt.rn.satfinite.e2m1x2.f32 PTX instruction.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [M, K] with dtype bf16 or fp16; K must be divisible by 16.

  • global_scale (torch.Tensor) – Global scale factor of shape [1] with dtype float32, on the same CUDA device as input.

Returns:

(fp4_output, block_scales) where fp4_output is packed FP4 data of shape [M, K/2] with dtype uint8 and block_scales are per-block FP8 E4M3 scales of shape [M, K/16] with dtype uint8.

Return type:

Tuple[torch.Tensor, torch.Tensor]