flashinfer.prefill.fmha_v2_prefill_deepseek

flashinfer.prefill.fmha_v2_prefill_deepseek(query: Tensor, key: Tensor, value: Tensor, out: Tensor, num_heads: int, head_dim: int, seq_len: int, scale_softmax: float, scale_bmm1: float | None = None, scale_bmm2: float | None = None, return_lse: bool = False, lse: Tensor | None = None) Tensor | Tuple[Tensor, Tensor]
Parameters:
  • query (torch.Tensor) – query tensor with shape [batch_size, seq_len, num_heads, head_dim]

  • key (torch.Tensor) – key tensor with shape [batch_size, seq_len, num_heads, head_dim]

  • value (torch.Tensor) – value tensor with shape [batch_size, seq_len, num_heads, head_dim]

  • out (torch.Tensor) – output tensor with shape [batch_size, seq_len, num_heads, head_dim]

  • return_lse (bool) – whether to return the log-sum-exp of attention output

  • num_heads (int) – number of heads

  • head_dim (int) – head dimension

  • seq_len (int) – sequence length

  • scale_softmax (float) – scale for softmax

  • scale_bmm1 (Optional[float]) – scale for bmm1

  • scale_bmm2 (Optional[float]) – scale for bmm2

  • lse (Optional[torch.Tensor]) – log-sum-exp of attention output

Returns:

out – output torch.Tensor or Tuple[torch.Tensor, torch.Tensor]. If return_lse is True, the output will be a tuple of two tensors, the first is the output tensor, the second is the lse tensor. If return_lse is False, the output will be a single tensor.

Return type:

Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]