flashinfer.nvfp4_attention_sm120.nvfp4_attention_sm120_quantize_qkv¶
- flashinfer.nvfp4_attention_sm120.nvfp4_attention_sm120_quantize_qkv(q: Tensor, k: Tensor, v: Tensor, per_block_mean: bool = True) Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]¶
Preprocess and quantize dense Q/K/V tensors for SM120 NVFP4 attention.
The input layout is
[batch, num_heads, seq_len, head_dim]. Inputs must be contiguous CUDA tensors with the same shape, dtype, and device. The sequence dimension is padded to a multiple of 128 before Q/K/V are quantized.- Parameters:
q (torch.Tensor) – Dense Q/K/V tensors with dtype
torch.float16ortorch.bfloat16.k (torch.Tensor) – Dense Q/K/V tensors with dtype
torch.float16ortorch.bfloat16.v (torch.Tensor) – Dense Q/K/V tensors with dtype
torch.float16ortorch.bfloat16.per_block_mean (bool, optional) – Whether to center Q per 128-token block. When
False, Q is centered once across the full sequence.
- Returns:
q_fp4,k_fp4, transposedv_fp4_t, scale tensorsq_scale,k_scale,v_scale_t, and the expanded FP32 QK correction.- Return type:
Tuple[torch.Tensor, …]