flashinfer.prefill.fmha_v2_prefill_deepseek¶
- flashinfer.prefill.fmha_v2_prefill_deepseek(query: Tensor, key: Tensor, value: Tensor, out: Tensor, num_heads: int, head_dim: int, seq_len: int, scale_softmax: float, scale_bmm1: float | None = None, scale_bmm2: float | None = None, return_lse: bool = False, lse: Tensor | None = None) Tensor | Tuple[Tensor, Tensor]¶
- Parameters:
query (torch.Tensor) – query tensor with shape [batch_size, seq_len, num_heads, head_dim]
key (torch.Tensor) – key tensor with shape [batch_size, seq_len, num_heads, head_dim]
value (torch.Tensor) – value tensor with shape [batch_size, seq_len, num_heads, head_dim]
out (torch.Tensor) – output tensor with shape [batch_size, seq_len, num_heads, head_dim]
return_lse (bool) – whether to return the log-sum-exp of attention output
num_heads (int) – number of heads
head_dim (int) – head dimension
seq_len (int) – sequence length
scale_softmax (float) – scale for softmax
scale_bmm1 (Optional[float]) – scale for bmm1
scale_bmm2 (Optional[float]) – scale for bmm2
lse (Optional[torch.Tensor]) – log-sum-exp of attention output
- Returns:
out – output torch.Tensor or Tuple[torch.Tensor, torch.Tensor]. If return_lse is True, the output will be a tuple of two tensors, the first is the output tensor, the second is the lse tensor. If return_lse is False, the output will be a single tensor.
- Return type:
Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]