flashinfer.cute_dsl¶

CuTe-DSL implementations of selected FlashInfer kernels. These symbols are available only when the nvidia-cutlass-dsl package is installed and the host has a supported NVIDIA GPU; the module guards its imports with is_cute_dsl_available().

Note

A handful of GEMM symbols (grouped_gemm_nt_masked, Sm100BlockScaledPersistentDenseGemmKernel, create_scale_factor_tensor) used to live in flashinfer.cute_dsl and are still re-exported for backwards compatibility, but their canonical home is flashinfer.gemm. New code should import from flashinfer.gemm.

Availability¶

is_cute_dsl_available()

Return True when the optional CuTe DSL stack is importable.

RMSNorm + FP4 Quantization¶

`rmsnorm_fp4quant`(input, weight[, y_fp4, ...])	Fused RMS normalization with FP4 quantization using CuTe-DSL.
`add_rmsnorm_fp4quant`(input, residual, weight)	Fused Add + RMS normalization + FP4 quantization using CuTe-DSL.

class flashinfer.cute_dsl.RMSNormFP4QuantKernel(dtype: Numeric, H: int, block_size: int, output_swizzled: bool, is_fp16: bool, sm_version: int | None = None, scale_format: str | None = None)¶

Fused RMSNorm + FP4 Quantization Kernel.

Key optimizations: 1. Half2/BFloat2 SIMD for max-abs computation 2. Branchless scale clamping via fmin_f32 3. Cluster synchronization for large H dimensions 4. Direct 128-bit vectorized global loads

__init__(dtype: Numeric, H: int, block_size: int, output_swizzled: bool, is_fp16: bool, sm_version: int | None = None, scale_format: str | None = None)¶

kernel(mX: Tensor, mW: Tensor, mY: Tensor, mS: Tensor, mGlobalScale: Tensor, M: Int32, eps: Float32, enable_pdl: Constexpr[bool], tv_layout: Layout, tiler_mn: int | Integer | Tuple[Shape, ...])¶

Device kernel with cluster synchronization for large H.

mGlobalScale contains the global scale value. The kernel reads it and computes 1/global_scale, which is multiplied with rstd to apply: y = x * rstd * w / global_scale = rmsnorm(x, w) / global_scale

class flashinfer.cute_dsl.AddRMSNormFP4QuantKernel(dtype: Numeric, H: int, block_size: int, output_swizzled: bool, is_fp16: bool, sm_version: int | None = None, scale_format: str | None = None, output_both_sf_layouts: bool = False)¶

Fused Add + RMSNorm + FP4 Quantization Kernel.

Computes:

residual = input + residual (in-place update)
y = RMSNorm(residual) * weight
quantize y to FP4

The residual tensor is modified in-place. Supports both NVFP4 (block_size=16) and MXFP4 (block_size=32) formats.

__init__(dtype: Numeric, H: int, block_size: int, output_swizzled: bool, is_fp16: bool, sm_version: int | None = None, scale_format: str | None = None, output_both_sf_layouts: bool = False)¶

kernel(mX: Tensor, mR: Tensor, mW: Tensor, mY: Tensor, mS: Tensor, mS_unswizzled: Tensor, mGlobalScale: Tensor, M: Int32, eps: Float32, enable_pdl: Constexpr[bool], tv_layout: Layout, tiler_mn: int | Integer | Tuple[Shape, ...])¶

Device kernel with cluster sync and Half2 SIMD.

Performs: 1. h = input + residual (writes h back to mR in-place) 2. y = h * rstd * w / global_scale = rmsnorm(h, w) / global_scale 3. quantizes y to FP4

mGlobalScale contains the global scale value. The kernel reads it and computes 1/global_scale, which is multiplied with rstd to apply: y = h * rstd * w / global_scale = rmsnorm(h, w) / global_scale

Attention Wrappers¶

CuTe-DSL implementations of the batch attention wrappers.

class flashinfer.cute_dsl.attention.wrappers.batch_mla.BatchMLADecodeCuteDSLWrapper(workspace_buffer: Tensor)¶

PyTorch-facing wrapper for the modular MLA decode kernel.

Usage:

wrapper = BatchMLADecodeCuteDSLWrapper(workspace_buffer)
wrapper.plan(
    kv_lora_rank=512, qk_rope_head_dim=64, num_heads=128,
    page_size=64, q_dtype=torch.bfloat16,
)
out = wrapper.run(query, kv_cache, block_tables, seq_lens, max_seq_len,
                  softmax_scale=0.125)

__init__(workspace_buffer: Tensor) → None¶

Bind the wrapper to a user-provided workspace buffer.

Parameters:: workspace_buffer (torch.Tensor) – Pre-allocated workspace buffer on the target CUDA device. Must have dtype torch.int8 or torch.uint8; the size determines the maximum batch this wrapper can handle without re-allocation.

plan(kv_lora_rank: int = 512, qk_rope_head_dim: int = 64, num_heads: int = 128, page_size: int = 1, q_dtype: dtype = torch.bfloat16, out_dtype: dtype | None = None, is_var_seq: bool = True, enable_pdl: bool | None = None, variant: AttentionVariant | None = None) → None¶

Compile (or retrieve cached) MLA decode kernel for the given config.

Parameters:

kv_lora_rank (int) – Latent dimension (e.g. 512).
qk_rope_head_dim (int) – RoPE dimension (e.g. 64).
num_heads (int) – Number of attention heads (typically 128 for DeepSeek-V3).
page_size (int) – KV cache page size.
q_dtype (torch.dtype) – Query/KV data type (float16 or bfloat16).
out_dtype (Optional[torch.dtype]) – Output data type. Defaults to same as q_dtype.
is_var_seq (bool) – Whether sequence lengths vary across the batch.
enable_pdl (Optional[bool]) – Whether to enable Programmatic Dependent Launch. Auto-detects if None.
variant (Optional[AttentionVariant]) – Attention variant (ALiBi, SoftCapping, AttentionWithSink, etc.). None uses standard softmax attention.

run(q: Tensor, kv_cache: Tensor, block_tables: Tensor, seq_lens: Tensor, max_seq_len: int, softmax_scale: float, output_scale: float = 1.0, out: Tensor | None = None) → Tensor¶

Run the MLA decode kernel.

Parameters:

q (torch.Tensor) – [B, q_len, H, D_qk] where D_qk = kv_lora_rank + qk_rope_head_dim.
kv_cache (torch.Tensor) – [num_pages, page_size, D_total] (3D) or [num_pages, 1, page_size, D_total] (4D).
block_tables (torch.Tensor) – [B, max_pages] page table indices.
seq_lens (torch.Tensor) – [B] per-request KV sequence lengths.
max_seq_len (int) – Maximum sequence length across the batch.
softmax_scale (float) – Scale factor for QK^T before softmax.
output_scale (float) – Scale factor applied to the output.
out (Optional[torch.Tensor]) – Pre-allocated output [B, q_len, H, kv_lora_rank].

Returns:

Output tensor [B, q_len, H, kv_lora_rank].

Return type:

torch.Tensor

class flashinfer.cute_dsl.attention.wrappers.batch_prefill.BatchPrefillCuteDSLWrapper(float_workspace_buffer: Tensor, use_cuda_graph: bool = False)¶

PyTorch-facing wrapper for the CuTe-DSL ragged-KV batch prefill kernel.

This wrapper exposes a plan + run API compatible with flashinfer.prefill.BatchPrefillWithRaggedKVCacheWrapper, but compiles a CuTe-DSL kernel under the hood instead of the C++ FA2/FA3 path.

Example

wrapper = BatchPrefillCuteDSLWrapper(workspace_buffer)
wrapper.plan(qo_indptr, kv_indptr,
             num_qo_heads=32, num_kv_heads=8, head_dim_qk=128)
out = wrapper.run(q, k, v)

__init__(float_workspace_buffer: Tensor, use_cuda_graph: bool = False) → None¶

Initialise the wrapper and bind it to a workspace buffer.

Parameters:

float_workspace_buffer (torch.Tensor) – Pre-allocated workspace buffer on the target CUDA device. Named for API parity with BatchPrefillWithRaggedKVCacheWrapper; callers typically pass torch.uint8. The CuTe-DSL kernel itself does not consume this buffer, but it is retained so the wrapper can mirror the parent API.
use_cuda_graph (bool) – Whether the wrapper will be used inside a CUDA graph capture. Defaults to False.

plan(qo_indptr, kv_indptr, num_qo_heads, num_kv_heads, head_dim_qk, head_dim_vo=None, causal=True, sm_scale=1.0, q_data_type=torch.float16, kv_data_type=torch.float16, window_left: int = -1, variant: AttentionVariant | None = None) → None¶

Compile the FMHA prefill kernel for the given configuration.

Parameters:

qo_indptr (torch.Tensor) – Cumulative query sequence lengths, shape [batch_size + 1].
kv_indptr (torch.Tensor) – Cumulative KV sequence lengths, shape [batch_size + 1].
num_qo_heads (int) – Number of query/output heads.
num_kv_heads (int) – Number of key/value heads (must divide num_qo_heads).
head_dim_qk (int) – Head dimension for queries and keys.
head_dim_vo (Optional[int]) – Head dimension for values and output. Must equal head_dim_qk if set.
causal (bool) – Whether to apply causal masking.
sm_scale (float) – Softmax scale factor (typically 1/sqrt(head_dim)).
q_data_type (torch.dtype) – Data type for queries (float16, bfloat16, or float8_e4m3fn).
kv_data_type (torch.dtype) – Data type for keys/values.
window_left (int) – Sliding window size. -1 disables sliding window.
variant (Optional[AttentionVariant]) – Attention variant (ALiBi, RPE, Sigmoid, etc.). None uses standard softmax.

run(q: Tensor, k: Tensor, v: Tensor, out: Tensor | None = None) → Tensor¶

Run the prefill attention computation.

Parameters:

q (torch.Tensor) – The query tensor with shape [total_q_len, num_heads, head_dim].
k (torch.Tensor) – The key tensor with shape [total_kv_len, num_heads, head_dim].
v (torch.Tensor) – The value tensor with shape [total_kv_len, num_heads, head_dim].
out (Optional[torch.Tensor], optional) – The output tensor. If None, a new tensor will be created.

Returns:

The output tensor with shape [total_q_len, num_heads, head_dim].

Return type:

torch.Tensor

class flashinfer.cute_dsl.attention.wrappers.batch_decode.BatchDecodeCuteDSLWrapper(float_workspace_buffer: Tensor, use_cuda_graph: bool = False)¶

PyTorch-facing wrapper for the ragged-KV CuTe DSL GQA decode kernel.

Assumes a contiguous (non-paged) KV cache where all batches have the same KV sequence length. For paged KV with varying sequence lengths use BatchDecodePagedCuteDSLWrapper instead.

__init__(float_workspace_buffer: Tensor, use_cuda_graph: bool = False) → None¶

Construct a ragged-KV CuTe DSL decode wrapper.

Parameters:

float_workspace_buffer (torch.Tensor) – Pre-allocated float32 workspace buffer used by the underlying CuTe DSL kernel for split-K partial reductions. The wrapper does not resize this buffer; the caller is responsible for sizing it for the largest expected batch (see plan()). The buffer’s device determines the device of subsequent kernel launches.
use_cuda_graph (bool) – If True, prepare the wrapper for capture in a CUDA graph so that subsequent run() calls are graph-safe (no host sync, stable workspace pointers). Defaults to False.

plan(batch_size: int, max_kv_len: int, num_qo_heads: int, num_kv_heads: int, head_dim: int, *deprecated_positional_args: Any, **kwargs: Any) → None¶

Compile the ragged-KV decode kernel for the planned configuration.

Parameters:

batch_size (int) – Representative batch size used to auto-tune kv_splits. Runtime batches may differ.
max_kv_len (int) – Representative KV sequence length used for kv_splits tuning.
num_qo_heads (int) – GQA configuration. num_qo_heads must be a multiple of num_kv_heads and head_dim must be a multiple of 64.
num_kv_heads (int) – GQA configuration. num_qo_heads must be a multiple of num_kv_heads and head_dim must be a multiple of 64.
head_dim (int) – GQA configuration. num_qo_heads must be a multiple of num_kv_heads and head_dim must be a multiple of 64.
q_data_type (torch.dtype) – Q/K/V/O dtypes. Q and KV must match. o_data_type defaults to q_data_type (or float16 for fp8 inputs).
kv_data_type (torch.dtype) – Q/K/V/O dtypes. Q and KV must match. o_data_type defaults to q_data_type (or float16 for fp8 inputs).
o_data_type (torch.dtype) – Q/K/V/O dtypes. Q and KV must match. o_data_type defaults to q_data_type (or float16 for fp8 inputs).
sm_scale (Optional[float]) – Softmax scale; defaults to 1 / sqrt(head_dim).
kv_splits (Optional[int]) – Threadblocks per sequence (flash decoding). None auto-tunes from the planned shape and the device SM count.
reduction (str) – "kernel", "atomic", "none", or "auto" (default). "none" skips flash-decoding entirely (no reduction kernel, no cluster atomics) and requires kv_splits == 1. Atomic reduction is faster than kernel reduction but requires kv_splits in {1, 2, 4, 8, 16} and an output dtype in {float32, float16, bfloat16}. "auto" picks "none" when kv_splits == 1, "atomic" for compatible dtypes and small kv_splits, else "kernel".
q_len_per_req (int) – Predicted tokens per request (1 for plain decode, >1 for speculative decode).
window_left (int) – Sliding-window left bound. None disables left bound.
window_right (int) – Sliding-window right bound. None disables right bound.

Note

Optional arguments after head_dim are accepted positionally for backward compatibility, but that calling convention is deprecated and scheduled for removal in a future release. Pass them by keyword instead. window_left and window_right are keyword-only. The legacy is_causal argument is deprecated; use window_left and window_right instead.

run(q: Tensor, k: Tensor, v: Tensor, *deprecated_positional_args: Any, **kwargs: Any) → Tensor¶

Run ragged-KV GQA decode.

Parameters:

q (torch.Tensor) – Shape [batch_size, q_len_per_req, num_qo_heads, head_dim]. q_len_per_req is read from q.shape[1] at run time; it does not have to match the value passed to plan() (which is only a compile-time tile-size hint).
k (torch.Tensor) – Shape [batch_size, seq_len, num_kv_heads, head_dim]. Both must have the same seq_len.
v (torch.Tensor) – Shape [batch_size, seq_len, num_kv_heads, head_dim]. Both must have the same seq_len.
out (Optional[torch.Tensor]) – Pre-allocated output buffer. For atomic reduction it must be zero-initialized before being passed in.
sm_scale (Optional[float]) – Per-call override of the softmax scale set at plan() time.
o_scale (Optional[float]) – Output scale applied to the final O before it is written. The cute-dsl kernel folds this in for free in the reduction epilogue (no separate post-kernel multiply). Defaults to 1.0.
sinks (Optional[torch.Tensor]) – Contiguous float32 per-head attention sink logits on the query device, shape (num_qo_heads,). When provided, the sink logit is included in the softmax denominator and receives no output value contribution.
lse (Optional[torch.Tensor]) – Pre-allocated float32 buffer of shape (batch_size, q_len_per_req, num_qo_heads) to receive the log-sum-exp (log2 base, matching flashinfer convention). When None (default) the kernel skips the LSE write entirely; otherwise a log2-base LSE variant is lazily compiled on first use (cache hit afterwards).
enable_pdl (bool) – Whether to launch with Programmatic Dependent Launch (PDL). Default True. Set to False to disable PDL when the target device does not support it. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization

Note

Optional arguments after v are accepted positionally for backward compatibility, but that calling convention is deprecated and scheduled for removal in a future release. Pass them by keyword instead. sinks is keyword-only.

class flashinfer.cute_dsl.attention.wrappers.batch_decode.BatchDecodePagedCuteDSLWrapper(float_workspace_buffer: Tensor, use_cuda_graph: bool = False)¶

PyTorch-facing wrapper for the paged CuTe DSL GQA decode kernel.

__init__(float_workspace_buffer: Tensor, use_cuda_graph: bool = False) → None¶

Construct a paged-KV CuTe DSL decode wrapper.

Parameters:

float_workspace_buffer (torch.Tensor) – Pre-allocated float32 workspace buffer used by the underlying CuTe DSL kernel for split-K partial reductions. The wrapper does not resize this buffer; the caller is responsible for sizing it for the largest expected batch and page table (see plan()). The buffer’s device determines the device of subsequent kernel launches.
use_cuda_graph (bool) – If True, prepare the wrapper for capture in a CUDA graph so that subsequent run() calls are graph-safe (no host sync, stable workspace pointers). Defaults to False.

plan(indptr: Tensor, indices: Tensor, seq_lens: Tensor, num_qo_heads: int, num_kv_heads: int, head_dim: int, page_size: int, *deprecated_positional_args: Any, **kwargs: Any) → None¶

Plan paged GQA decode for the given problem.

Parameters:

indptr (torch.Tensor (int32, [batch_size + 1])) – Prefix-sum offsets into indices.
indices (torch.Tensor (int32, [num_pages_total])) – Flat per-sequence virtual page indices.
seq_lens (torch.Tensor (int32, [batch_size])) – Per-sequence KV length in tokens. The kernel reads this directly; callers that have last_page_len instead should use flashinfer.page.get_seq_lens() to convert.
num_qo_heads (int) – GQA + paging configuration. page_size must be in {8, 16, 32, 64} and head_dim a positive multiple of 64.
num_kv_heads (int) – GQA + paging configuration. page_size must be in {8, 16, 32, 64} and head_dim a positive multiple of 64.
head_dim (int) – GQA + paging configuration. page_size must be in {8, 16, 32, 64} and head_dim a positive multiple of 64.
page_size (int) – GQA + paging configuration. page_size must be in {8, 16, 32, 64} and head_dim a positive multiple of 64.
q_data_type (torch.dtype) – Q/K/V/O dtypes; kv_data_type must equal q_data_type.
kv_data_type (torch.dtype) – Q/K/V/O dtypes; kv_data_type must equal q_data_type.
o_data_type (torch.dtype) – Q/K/V/O dtypes; kv_data_type must equal q_data_type.
sm_scale (Optional[float]) – Softmax scale; defaults to 1 / sqrt(head_dim).
kv_splits (Optional[int]) – Threadblocks per sequence (flash decoding). None auto-tunes from the planned shapes and SM count.
reduction (str) – "kernel" (deterministic with workspace), "atomic" (cluster reduction, faster but lower precision), "none" (no flash-decoding split-K; requires kv_splits == 1), or "auto" (picks "none" when kv_splits == 1, else atomic for compatible dtypes, else kernel).
q_len_per_req (int) – Predicted tokens per request (1 for plain decode).
window_left (int) – Sliding-window left bound. None disables left bound.
window_right (int) – Sliding-window right bound. None disables the right bound. Defaults to 0.
max_kv_len (Optional[int]) – Maximum KV sequence length across the batch. Used to auto-tune kv_splits; pass it explicitly to avoid a GPU->CPU sync.
non_blocking (bool) – Async device copies for the plan-time integer buffers.
precompile_skip_softmax_kernel (bool) – If True, also compile the BLASST skip-softmax variant of the kernel at plan() time, so the first run() call that passes skip_softmax_threshold_scale_factor is fast.

Note

Optional arguments after page_size are accepted positionally for backward compatibility, but that calling convention is deprecated and scheduled for removal in a future release. Pass them by keyword instead. window_left and window_right are keyword-only. The legacy is_causal argument is deprecated; use window_left and window_right instead.

run(q: Tensor, k_cache: Tensor, v_cache: Tensor, *deprecated_positional_args: Any, **kwargs: Any) → Tensor¶

Run paged GQA decode.

Parameters:

q (torch.Tensor) – [batch_size * q_len_per_req, num_qo_heads, head_dim] or [batch_size, q_len_per_req, num_qo_heads, head_dim]. q_len_per_req is read from q.shape at run time; it does not have to match the value passed to plan() (which is only a compile-time tile-size hint).
k_cache (torch.Tensor) – Logical shape [num_pages, page_size, num_kv_heads, head_dim]. Both NHD-contiguous layouts and HND layouts (presented as a transposed view) are accepted; the kernel handles arbitrary strides as long as head_dim is innermost.
v_cache (torch.Tensor) – Logical shape [num_pages, page_size, num_kv_heads, head_dim]. Both NHD-contiguous layouts and HND layouts (presented as a transposed view) are accepted; the kernel handles arbitrary strides as long as head_dim is innermost.
out (Optional[torch.Tensor]) – Pre-allocated output buffer. For atomic reduction it must be zero-initialized before being passed in.
sm_scale (Optional[float]) – Per-call override of the softmax scale set at plan() time.
o_scale (Optional[float]) – Output scale applied to the final O before it is written. The cute-dsl kernel folds this in for free in the reduction epilogue (no separate post-kernel multiply). Defaults to 1.0.
sinks (Optional[torch.Tensor]) – Contiguous float32 per-head attention sink logits on the query device, shape (num_qo_heads,). When provided, the sink logit is included in the softmax denominator and receives no output value contribution.
skip_softmax_threshold_scale_factor (Optional[float]) – BLASST skip-softmax scale factor. The kernel divides this by each batch’s KV seqlen to obtain the per-request effective threshold. Must be > 0 when set. None (default) dispatches to the standard kernel; a value triggers lazy compile of the BLASST variant on first use, or hits the precompiled cache if plan(precompile_skip_softmax_kernel=True) was used.
lse (Optional[torch.Tensor]) – Pre-allocated float32 buffer of shape (batch_size, q_len_per_req, num_qo_heads) (or the flat equivalent (batch_size * q_len_per_req, num_qo_heads)) to receive the log-sum-exp (log2 base, matching flashinfer convention). When None (default) the kernel skips the LSE write; otherwise an LSE variant is lazily compiled on first use.
enable_pdl (bool) – Whether to launch with Programmatic Dependent Launch (PDL). Default True. Set to False to disable PDL when the target device does not support it. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization

Note

Optional arguments after v_cache are accepted positionally for backward compatibility, but that calling convention is deprecated and scheduled for removal in a future release. Pass them by keyword instead. sinks is keyword-only.

Block Sparse Attention¶

CuTe-DSL block-sparse attention forward kernels.

`bsa_attn_fwd`(q, k, v, q2k_block_index, ...)	Forward pass for BSA block-sparse attention (SM100 only).
`bsa_attn_blk64_fwd`(q, k, v, q2k_block_index, ...)	Forward pass for BSA block-sparse attention using the blk64 CUDA C++ kernel (SM100 only).