flashinfer.fp4_quantization.nvfp4_kv_quantize¶

flashinfer.fp4_quantization.nvfp4_kv_quantize(input: Tensor, global_scale: Tensor) → Tuple[Tensor, Tensor]¶

GPU quantization to NVFP4 KV cache format with linear block scale layout.

Requires SM100+ (Blackwell) for the cvt.rn.satfinite.e2m1x2.f32 PTX instruction.

Parameters:

input (torch.Tensor) – Input tensor of shape [M, K] with dtype bf16 or fp16. K must be divisible by 16.
global_scale (torch.Tensor) – Global scale factor of shape [1] with dtype float32, on the same CUDA device as input.

Returns:

Return type:

Tuple[torch.Tensor, torch.Tensor]