flashinfer.xqa.xqa¶

flashinfer.xqa.xqa(q: Tensor, k_cache: Tensor, v_cache: Tensor, page_table: Tensor, seq_lens: Tensor, output: Tensor, workspace_buffer: Tensor, semaphores: Tensor, num_kv_heads: int, page_size: int, sinks: Tensor | None = None, q_scale: float | Tensor = 1.0, kv_scale: float | Tensor = 1.0, sliding_win_size: int = 0, kv_layout: str = 'NHD', sm_count: int | None = None, enable_pdl: bool | None = None, rcp_out_scale: float = 1.0, q_seq_len: int = 1, mask: Tensor | None = None) → None¶

Apply attention with paged KV cache using XQA kernel. :param q: Query tensor with shape [batch_size, beam_width, num_q_heads, head_dim] if not using speculative decoding,

or [batch_size, beam_width, q_seq_len, num_q_heads, head_dim] if using speculative decoding. q_seq_len is the number of speculative decoding tokens. Data type should be torch.float16 or torch.bfloat16. Now only beam_width 1 is supported.

Parameters:

k_cache (torch.Tensor) – Paged K cache tensor with shape [num_pages, page_size, num_kv_heads, head_dim] if kv_layout is NHD, or [num_pages, num_kv_heads, page_size, head_dim] if kv_layout is HND. Data type should match query tensor or be torch.float8_e4m3fn, in which case xqa will run fp8 calculation. Should be the same data type as v_cache.
v_cache (torch.Tensor) – Paged V cache tensor with shape [num_pages, page_size, num_kv_heads, head_dim] if kv_layout is NHD, or [num_pages, num_kv_heads, page_size, head_dim] if kv_layout is HND. Data type should match query tensor or be torch.float8_e4m3fn, in which case xqa will run fp8 calculation. Should be the same data type as k_cache.
page_table (torch.Tensor) – Page table tensor with shape batch_size, nb_pages_per_seq. Data type should be torch.int32. K and V share the same table.
seq_lens (torch.Tensor) – Sequence lengths tensor with shape [batch_size, beam_width]. Data type should be torch.uint32.
output (torch.Tensor) – Output tensor with shape that matches the query tensor. Data type should match query tensor or kv tensor. This tensor will be modified in-place.
workspace_buffer (torch.Tensor) – Workspace buffer for temporary computations. Data type should be torch.uint8.
semaphores (torch.Tensor) – Semaphore buffer for synchronization. Data type should be torch.uint32.
num_kv_heads (int) – Number of key-value heads in the attention mechanism.
page_size (int) – Size of each page in the paged KV cache. Must be one of [16, 32, 64, 128].
sinks (Optional[torch.Tensor], default=None) – Attention sink values with shape [num_kv_heads, head_group_ratio]. Data type should be torch.float32. If None, no attention sinks are used.
q_scale (Union[float, torch.Tensor], default=1.0) – Scale factor for query tensor.
kv_scale (Union[float, torch.Tensor], default=1.0) – Scale factor for KV cache.
sliding_win_size (int, default=0) – Sliding window size for attention. If 0, no sliding window is used.
kv_layout (str, default="NHD") – The layout of the KV cache. Can be either NHD or HND.
sm_count (Optional[int], default=None) – Number of streaming multiprocessors to use. If None, will be inferred from the device.
enable_pdl (Optional[bool], default=None) – Whether to enable PDL (Persistent Data Loader) optimization. If None, will be set to True if hardware supports it.
rcp_out_scale (float, default=1.0) – Reciprocal of output scale factor.
q_seq_len (int, default=1) – Query sequence length. When > 1, enables speculative decoding mode.
mask (Optional[torch.Tensor], default=None) – Causal attention mask for speculative decoding mode (when q_seq_len > 1). Shape: [batch_size, q_seq_len, mask_size_per_row] where mask_size_per_row = ((q_seq_len + 31) // 32) * 2. Data type should be torch.uint16 (bit-packed format, aligned to 32 bits).

Note

The function automatically infers several parameters from tensor shapes: - batch_size from q.shape[0] - num_q_heads from q.shape[-2] - head_dim from q.shape[-1] - input_dtype from q.dtype - kv_cache_dtype from k.dtype - head_group_ratio from num_q_heads // num_kv_heads - max_seq_len from page_table.shape[-1] * page_size