flashinfer.fused_moe.cute_dsl_fused_moe_nvfp4¶
- flashinfer.fused_moe.cute_dsl_fused_moe_nvfp4(x: Tensor, x_sf: Tensor, token_selected_experts: Tensor, token_final_scales: Tensor, w1_weight: Tensor, w1_weight_sf: Tensor, w1_alpha: Tensor, fc2_input_scale: Tensor, w2_weight: Tensor, w2_weight_sf: Tensor, w2_alpha: Tensor, num_experts: int, top_k: int, num_local_experts: int | None = None, local_expert_offset: int = 0, output_dtype: dtype = torch.bfloat16, use_fused_finalize: bool = True, moe_output: Tensor | None = None, aux_stream: Stream | None = None, enable_pdl: bool = True) Tensor¶
Run a fused MoE forward pass using the CuTe-DSL NVFP4 kernels.
Supported architectures: SM100, SM103. This is the simple functional API; for CUDA-graph support use
CuteDslMoEWrapperinstead.Auto-tuning is controlled by the
autotune()context manager:with autotune(True): output = cute_dsl_fused_moe_nvfp4(...)
- Parameters:
x (torch.Tensor) – NVFP4-quantized input of shape
[num_tokens, hidden_size // 2].x_sf (torch.Tensor) – Scale factors for
x.token_selected_experts (torch.Tensor) – Expert assignments of shape
[num_tokens, top_k].token_final_scales (torch.Tensor) – Routing weights of shape
[num_tokens, top_k].w1_weight (torch.Tensor) – GEMM1 weights (gate + up fused).
w1_weight_sf (torch.Tensor) – Scale factors for
w1_weight.w1_alpha (torch.Tensor) – Per-expert global scale for GEMM1.
fc2_input_scale (torch.Tensor) – Global scale for GEMM2 input quantization.
w2_weight (torch.Tensor) – GEMM2 weights (down projection).
w2_weight_sf (torch.Tensor) – Scale factors for
w2_weight.w2_alpha (torch.Tensor) – Per-expert global scale for GEMM2.
num_experts (int) – Total number of experts.
top_k (int) – Number of experts routed to per token.
num_local_experts (Optional[int]) – Local experts for expert parallelism. Defaults to
num_experts.local_expert_offset (int) – Offset of local experts in the global expert space. Defaults to
0.output_dtype (torch.dtype) – Output dtype. Defaults to
torch.bfloat16.use_fused_finalize (bool) – Whether to use the fused finalize path. Defaults to
True.moe_output (Optional[torch.Tensor]) – Pre-allocated output buffer. Allocated internally if
None.aux_stream (Optional[torch.cuda.Stream]) – Optional auxiliary CUDA stream used to overlap setup work with the main computation.
enable_pdl (bool) – Enable Programmatic Dependent Launch. Defaults to
True.
- Returns:
Output tensor of shape
[num_tokens, hidden_size].- Return type:
torch.Tensor