# Attention States and Recursive Attention#

FlashInfer introduces the concept of **attention states**, which fully characterizes
the attention between a query and a set of key/value pairs. We further defines a
**merge** operator on the **attention states**. This merge operator facilitates the
computation of complete attention by allowing the recursive merging of attention states.

Suppose we define \(s_i = \mathbf{q}\mathbf{k}_i^T\) as the pre-softmax attention score between the query \(\mathbf{q}\) and the key \(\mathbf{k}_i\). The Self-Attention score on index \(i\) can be generalized to index set \(I\):

We can also generalize the value on index \(i\) to index set \(I\):

The \(softmax\) function is restricted to the index set \(I\). Note that \(\mathbf{v}(\{1,2,\cdots, n\})\) is the self-attention output of the entire sequence.
The *attention state* of the index set \(i\) can be defined as a tuple \((s(I), \mathbf{v}(I))\), then we can define a binary **merge** operator \(\oplus\) of two attention states as ((in practice we will minus $s$ with maximum value to guarantee numerical stability and here we omit them for simplicity):

the **merge** operator can be generalized to any number of attention state inputs:

The above n-ary merge operator is consistent with the binary merge operator, and we can prove the operator is *communicative* and *associative*. There are different ways to get the attention state of the entire sequence by merging the attention states of index subsets, and the final outcome is mathematically equivalent:

Note

The generalized score \(s\) is also known as log-sum-exp (`lse`

for short).

## Applications#

Note that \(\oplus\) operator is **commutative** and **associative**, which means we can
safely offload the self-attention computation on a subset of KV to different devices
and **merge** the results **in any order**.

There are several interesting applications of this recursive form of self-attention in FlashInfer so far:

- Shared-Prefix Batch Decoding
Many LLM applications involves batch decoding with the shared long prompt, FlashInfer decomposes attention on the entire KV-Cache to shared prefix attention and unique suffixes attention. This decomposition enables the offloading of these components to different kernel implementations, resulting in a remarkable 30x acceleration in scenarios with long context and large batch-size. Such decomposition accelerates the operator by 30 times in long context setting. Check our blog post on more details about this application, and Cascade Attention on how to use this feature in FlashInfer.

- KV Sequence Parallelism
For long context LLM inference/serving, the batch size and number of heads per GPU is limited by the GPU memory, and the default parallelism strategy cannot use all SMs in GPUs, which results in suboptimal performance. Inspired by Split-K trick in GEMM optimizations. FlashInfer partitions the KV sequence dimension and dispatches the attention computations to different thread-blocks and merge them in a second pass. This same idea was also proposed in Flash-Decoding, you can check their great blog post for visualizations and more details.