flashinfer.rope.apply_rope_with_cos_sin_cache¶
- flashinfer.rope.apply_rope_with_cos_sin_cache(positions: torch.Tensor, query: torch.Tensor, key: torch.Tensor, head_size: int, cos_sin_cache: torch.Tensor, is_neox: bool = True) Tuple[torch.Tensor, torch.Tensor] ¶
Apply rotary embedding to keys and queries with precomputed cos/sin values. This is designed to be compatible with the SGL/vLLM implementation.
- Parameters:
positions (torch.Tensor) – Position indices, shape:
(nnz)
.query (torch.Tensor) – Query tensor, shape:
(nnz, num_q_heads * head_size)
.key (torch.Tensor) – Key tensor, shape:
(nnz, num_k_heads * head_size)
.cos_sin_cache (torch.Tensor) – Cosine and Sine cache tensor, shape:
(max_seq_len, rotary_dim)
. Cosine is the first half and Sine is the second half on rotary_dim.is_neox (bool) –
Whether to use Neox style RoPE, default:
True
.If
True
, the last dimension of the query/key tensor is not interleaved, i.e., we rorate the first half dimensions([..., :head_dim//2])
and the second half dimensions([..., head_dim//2:])
.If
False
, the last dimension of the query/key tensor is interleaved, i.e., we rotate the even dimensions([..., ::2])
and odd dimensions([..., 1::2])
.
- Returns:
query_out (torch.Tensor) – The rotated query tensor, shape:
(nnz, num_q_heads * head_size)
.key_out (torch.Tensor) – The rotated key tensor, shape:
(nnz, num_k_heads * head_size)
.
Note
The rotary dimension is determined by the cosine cache and sine cache.