flashinfer.nvfp4_attention_sm120.nvfp4_attention_sm120_quantize_qkv¶

flashinfer.nvfp4_attention_sm120.nvfp4_attention_sm120_quantize_qkv(q: Tensor, k: Tensor, v: Tensor, per_block_mean: bool = True) → Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]¶

Preprocess and quantize dense Q/K/V tensors for SM120 NVFP4 attention.

The input layout is [batch, num_heads, seq_len, head_dim]. Inputs must be contiguous CUDA tensors with the same shape, dtype, and device. The sequence dimension is padded to a multiple of 128 before Q/K/V are quantized.

Parameters:

q (torch.Tensor) – Dense Q/K/V tensors with dtype torch.float16 or torch.bfloat16.
k (torch.Tensor) – Dense Q/K/V tensors with dtype torch.float16 or torch.bfloat16.
v (torch.Tensor) – Dense Q/K/V tensors with dtype torch.float16 or torch.bfloat16.
per_block_mean (bool, optional) – Whether to center Q per 128-token block. When False, Q is centered once across the full sequence.

Returns:

q_fp4, k_fp4, transposed v_fp4_t, scale tensors q_scale, k_scale, v_scale_t, and the compact FP32 QK correction with shape [batch, num_heads, seq_len / 128, seq_len] ([batch, num_heads, 1, seq_len] when per_block_mean=False).

Return type:

Tuple[torch.Tensor, …]