flashinfer.fused_moe.cute_dsl_fused_moe_nvfp4¶

flashinfer.fused_moe.cute_dsl_fused_moe_nvfp4(x: Tensor, x_sf: Tensor, token_selected_experts: Tensor, token_final_scales: Tensor, w1_weight: Tensor, w1_weight_sf: Tensor, w1_alpha: Tensor, fc2_input_scale: Tensor, w2_weight: Tensor, w2_weight_sf: Tensor, w2_alpha: Tensor, num_experts: int, top_k: int, num_local_experts: int | None = None, local_expert_offset: int = 0, output_dtype: dtype = torch.bfloat16, use_fused_finalize: bool = True, moe_output: Tensor | None = None, aux_stream: Stream | None = None, enable_pdl: bool = True, activation_type: int = 3, swiglu_alpha: float = 1.0, swiglu_beta: float = 0.0, swiglu_limit: float = 3.4028234663852886e+38, *, per_token_scale: Tensor | None = None) → Tensor¶

Run a fused MoE forward pass using the CuTe-DSL NVFP4 kernels.

Supported architectures: SM100, SM103. This is the simple functional API; for CUDA-graph support use CuteDslMoEWrapper instead.

Auto-tuning is controlled by the autotune() context manager:

with autotune(True):
    output = cute_dsl_fused_moe_nvfp4(...)

Parameters:

x (torch.Tensor) – NVFP4-quantized input of shape [num_tokens, hidden_size // 2].
x_sf (torch.Tensor) – Scale factors for x.
token_selected_experts (torch.Tensor) – Expert assignments of shape [num_tokens, top_k].
token_final_scales (torch.Tensor) – Routing weights of shape [num_tokens, top_k].
w1_weight (torch.Tensor) – GEMM1 weights (gate + up fused).
w1_weight_sf (torch.Tensor) – Scale factors for w1_weight.
w1_alpha (torch.Tensor) – Per-expert global scale for GEMM1.
fc2_input_scale (torch.Tensor) – Global scale for GEMM2 input quantization.
w2_weight (torch.Tensor) – GEMM2 weights (down projection).
w2_weight_sf (torch.Tensor) – Scale factors for w2_weight.
w2_alpha (torch.Tensor) – Per-expert global scale for GEMM2.
num_experts (int) – Total number of experts.
top_k (int) – Number of experts routed to per token.
num_local_experts (Optional[int]) – Local experts for expert parallelism. Defaults to num_experts.
local_expert_offset (int) – Offset of local experts in the global expert space. Defaults to 0.
output_dtype (torch.dtype) – Output dtype. Defaults to torch.bfloat16.
use_fused_finalize (bool) – Whether to use the fused finalize path. Defaults to True.
moe_output (Optional[torch.Tensor]) – Pre-allocated output buffer. Allocated internally if None.
aux_stream (Optional[torch.cuda.Stream]) – Optional auxiliary CUDA stream used to overlap setup work with the main computation.
enable_pdl (bool) – Enable Programmatic Dependent Launch. Defaults to True.
activation_type (int) – FC1 activation type. Use ActivationType.Swiglu for gated SwiGLU and ActivationType.Relu2 for non-gated ReLU^2. swiglu_oai is represented as ActivationType.Swiglu with non-default swiglu_alpha/beta/limit.
swiglu_alpha (float) – SwiGLU parameters.
swiglu_beta (float) – SwiGLU parameters.
swiglu_limit (float) – SwiGLU parameters.
per_token_scale (Optional[torch.Tensor]) – Per-token input row scale for GEMM1. Passing this enables the per-token activation path.

Returns:

Output tensor of shape [num_tokens, hidden_size].

Return type:

torch.Tensor