flashinfer.fused_moe.cute_dsl_fused_moe_nvfp4

flashinfer.fused_moe.cute_dsl_fused_moe_nvfp4(x: Tensor, x_sf: Tensor, token_selected_experts: Tensor, token_final_scales: Tensor, w1_weight: Tensor, w1_weight_sf: Tensor, w1_alpha: Tensor, fc2_input_scale: Tensor, w2_weight: Tensor, w2_weight_sf: Tensor, w2_alpha: Tensor, num_experts: int, top_k: int, num_local_experts: int | None = None, local_expert_offset: int = 0, output_dtype: dtype = torch.bfloat16, use_fused_finalize: bool = True, moe_output: Tensor | None = None, aux_stream: Stream | None = None, enable_pdl: bool = True) Tensor

Run a fused MoE forward pass using the CuTe-DSL NVFP4 kernels.

Supported architectures: SM100, SM103. This is the simple functional API; for CUDA-graph support use CuteDslMoEWrapper instead.

Auto-tuning is controlled by the autotune() context manager:

with autotune(True):
    output = cute_dsl_fused_moe_nvfp4(...)
Parameters:
  • x (torch.Tensor) – NVFP4-quantized input of shape [num_tokens, hidden_size // 2].

  • x_sf (torch.Tensor) – Scale factors for x.

  • token_selected_experts (torch.Tensor) – Expert assignments of shape [num_tokens, top_k].

  • token_final_scales (torch.Tensor) – Routing weights of shape [num_tokens, top_k].

  • w1_weight (torch.Tensor) – GEMM1 weights (gate + up fused).

  • w1_weight_sf (torch.Tensor) – Scale factors for w1_weight.

  • w1_alpha (torch.Tensor) – Per-expert global scale for GEMM1.

  • fc2_input_scale (torch.Tensor) – Global scale for GEMM2 input quantization.

  • w2_weight (torch.Tensor) – GEMM2 weights (down projection).

  • w2_weight_sf (torch.Tensor) – Scale factors for w2_weight.

  • w2_alpha (torch.Tensor) – Per-expert global scale for GEMM2.

  • num_experts (int) – Total number of experts.

  • top_k (int) – Number of experts routed to per token.

  • num_local_experts (Optional[int]) – Local experts for expert parallelism. Defaults to num_experts.

  • local_expert_offset (int) – Offset of local experts in the global expert space. Defaults to 0.

  • output_dtype (torch.dtype) – Output dtype. Defaults to torch.bfloat16.

  • use_fused_finalize (bool) – Whether to use the fused finalize path. Defaults to True.

  • moe_output (Optional[torch.Tensor]) – Pre-allocated output buffer. Allocated internally if None.

  • aux_stream (Optional[torch.cuda.Stream]) – Optional auxiliary CUDA stream used to overlap setup work with the main computation.

  • enable_pdl (bool) – Enable Programmatic Dependent Launch. Defaults to True.

Returns:

Output tensor of shape [num_tokens, hidden_size].

Return type:

torch.Tensor