flashinfer.decode.trtllm_batch_decode_with_kv_cache

flashinfer.decode.trtllm_batch_decode_with_kv_cache(query: Tensor, kv_cache: Tensor | Tuple[Tensor, Tensor], workspace_buffer: Tensor, block_tables: Tensor, seq_lens: Tensor, max_seq_len: int, bmm1_scale: float, bmm2_scale: float, window_left: int = -1, out: Tensor | FP4Tensor | None = None, out_dtype: str | dtype | None = None, o_sf_scale: float | None = None, o_sf_vec_size: int | None = None, sinks: List[Tensor] | None = None, kv_layout: str = 'HND', enable_pdl: bool | None = None, backend: str = 'auto', q_len_per_req: int | None = 1) Tensor | FP4Tensor
Parameters:
  • query (torch.Tensor) – query tensor with shape [num_tokens, num_heads, head_dim], num_tokens = batch_size * q_len_per_request

  • kv_cache (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – If kv_cache is a single tensor, it should be a tensor with shape [num_pages, 1 or 2, num_kv_heads, page_size, head_dim] if kv_layout is HND, or [num_pages, 1 or 2, page_size, num_kv_heads, head_dim] if kv_layout is NHD. If kv_cache is a tuple of two tensors, it should be a tuple of two tensors with shape [num_pages, num_kv_heads, page_size, head_dim] if kv_layout is HND, or [num_pages, page_size, num_kv_heads, head_dim] if kv_layout is NHD. The first tensor is the key cache, and the second tensor is the value cache.

  • workspace_buffer (torch.Tensor. Must be initialized to 0 for its first use.) – workspace

  • block_tables (torch.Tensor) – page_table of kv cache, [batch_size, num_pages]

  • seq_lens (torch.Tensor) – A uint32 1D tensor indicating the kv sequence length of each prompt. shape: [batch_size]

  • max_seq_len (int) – max sequence length for kv_cache

  • bmm1_scale (float) – fused scale for bmm1 input.

  • bmm2_scale (float) – fused scale for bmm2 input.

  • window_left (int = -1) – The left (inclusive) window size for the attention window, when set to -1, the window size will be set to the full length of the sequence. Defaults to -1.

  • out (Optional[Union[torch.Tensor, FP4Tensor]] = None) – output tensor, if not provided, will be allocated with out_dtype, if out_dtype is not provided, will use the type of query.

  • out_dtype (Optional[Union[torch.dtype, str]] = None) – output dtype, if not provided, will use the type of out. For nvfp4, use string nvfp4.

  • o_sf_scale (Optional[float] = None) – scale for nvfp4 output tensor scale factor.

  • o_sf_vec_size (Optional[int] = None) – vector size for nvfp4 output tensor scale factor.

  • sinks (Optional[List[torch.Tensor]] = None) – additional value per head in the denominator of the softmax.

  • kv_layout (str = "HND") – The layout of the input k/v tensors, could be either NHD or HND. Defaults to HND.

  • enable_pdl (Optional[bool] = None) – Whether to enable Programmatic Dependent Launch (PDL). See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization When set to None, the backend will be chosen based on the device architecture and kernel availability.

  • backend (str = "auto") – The implementation backend, could be auto/xqa or trtllm-gen. Defaults to auto. When set to auto, the backend will be chosen based on the device architecture and kernel availability. For sm_100 and sm_103 (blackwell architecture), auto will choose trtllm-gen backend. For sm_90 (hopper architecture) and sm_120 (blackwell architecture), auto will choose xqa backend.

Returns:

out – output torch.Tensor or FP4Tensor.

Return type:

Union[torch.Tensor, FP4Tensor]