flashinfer.nvfp4_attention_sm120.nvfp4_attention_sm120_quantize_qkv

flashinfer.nvfp4_attention_sm120.nvfp4_attention_sm120_quantize_qkv(q: Tensor, k: Tensor, v: Tensor, per_block_mean: bool = True) Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]

Preprocess and quantize dense Q/K/V tensors for SM120 NVFP4 attention.

The input layout is [batch, num_heads, seq_len, head_dim]. Inputs must be contiguous CUDA tensors with the same shape, dtype, and device. The sequence dimension is padded to a multiple of 128 before Q/K/V are quantized.

Parameters:
  • q (torch.Tensor) – Dense Q/K/V tensors with dtype torch.float16 or torch.bfloat16.

  • k (torch.Tensor) – Dense Q/K/V tensors with dtype torch.float16 or torch.bfloat16.

  • v (torch.Tensor) – Dense Q/K/V tensors with dtype torch.float16 or torch.bfloat16.

  • per_block_mean (bool, optional) – Whether to center Q per 128-token block. When False, Q is centered once across the full sequence.

Returns:

q_fp4, k_fp4, transposed v_fp4_t, scale tensors q_scale, k_scale, v_scale_t, and the expanded FP32 QK correction.

Return type:

Tuple[torch.Tensor, …]