flashinfer.cascade.merge_state¶

flashinfer.cascade.merge_state(v_a: torch.Tensor, s_a: torch.Tensor, v_b: torch.Tensor, s_b: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor]¶

Merge the attention output V and the logsumexp value S from the two KV-segments. Check our tutorial on the mathematical details.

Parameters:

v_a (torch.Tensor) – The attention output from the KV segment A, shape: [seq_len, num_heads, head_dim].
s_a (torch.Tensor) – The logsumexp value from the KV segment A. expected to be a float32 tensor, shape: [seq_len, num_heads].
v_b (torch.Tensor) – The attention output from the KV segment B, shape: [seq_len, num_heads, head_dim].
s_b (torch.Tensor) – The logsumexp value from the KV segment B, expected to be a float32 tensor, shape: [seq_len, num_heads]

Returns:

V (torch.Tensor) – The merged attention output (equivalent to attention with merged KV-segment [A: B]), shape: [seq_len, num_heads, head_dim].
S (torch.Tensor) – The logsumexp value from the merged KV-segment [A: B], shape: [seq_len, num_heads].

Example

>>> import torch
>>> import flashinfer
>>> seq_len = 2048
>>> num_heads = 32
>>> head_dim = 128
>>> va = torch.randn(seq_len, num_heads, head_dim).half().to("cuda:0")
>>> sa = torch.randn(seq_len, num_heads, dtype=torch.float32).to("cuda:0")
>>> vb = torch.randn(seq_len, num_heads, head_dim).half().to("cuda:0")
>>> sb = torch.randn(seq_len, num_heads, dtype=torch.float32).to("cuda:0")
>>> v_merged, s_merged = flashinfer.merge_state(va, sa, vb, sb)
>>> v_merged.shape
torch.Size([2048, 32, 128])
>>> s_merged.shape
torch.Size([2048, 32])