flashinfer.cascade.merge_state_in_place¶

flashinfer.cascade.merge_state_in_place(v: torch.Tensor, s: torch.Tensor, v_other: torch.Tensor, s_other: torch.Tensor, mask: torch.Tensor | None = None) → None¶

Merge the self-attention state (v, s) with another state (v_other, s_other) in-place.

Parameters:

v (torch.Tensor) – The partial attention output to be updated in-place, shape: (seq_len, num_heads, head_dim).
s (torch.Tensor) – The partial logsumexpr value to be updated in-place, expected to be a float32 tensor, shape: (seq_len, num_heads).
v_other (torch.Tensor) – The other attention output to be merged, shape: (seq_len, num_heads, head_dim).
s_other (torch.Tensor) – The other logsumexp value to be merged, expected to be a float32 tensor, shape: (seq_len, num_heads).
mask (Optional[torch.Tensor]) – The boolean mask tensor for whether to merge the state for a corresponding sequence or not. Useful for CUDA graphs. If not specified (default), will merge states for all sequences. shape: [seq_len]

Example

>>> import torch
>>> import flashinfer
>>> seq_len = 2048
>>> num_heads = 32
>>> head_dim = 128
>>> v = torch.randn(seq_len, num_heads, head_dim).half().to("cuda:0")
>>> s = torch.randn(seq_len, num_heads, dtype=torch.float32).to("cuda:0")
>>> v_other = torch.randn(seq_len, num_heads, head_dim).half().to("cuda:0")
>>> s_other = torch.randn(seq_len, num_heads, dtype=torch.float32).to("cuda:0")
>>> flashinfer.merge_state_in_place(v, s, v_other, s_other)