flashinfer.fused_moe.trtllm_fp4_block_scale_moe¶

flashinfer.fused_moe.trtllm_fp4_block_scale_moe(routing_logits: Tensor, routing_bias: Tensor | None, hidden_states: Tensor, hidden_states_scale: Tensor | None, gemm1_weights: Tensor, gemm1_weights_scale: Tensor, gemm1_bias: Tensor | None, gemm1_alpha: Tensor | None, gemm1_beta: Tensor | None, gemm1_clamp_limit: Tensor | None, gemm2_weights: Tensor, gemm2_weights_scale: Tensor, gemm2_bias: Tensor | None, output1_scale_scalar: Tensor | None, output1_scale_gate_scalar: Tensor | None, output2_scale_scalar: Tensor | None, num_experts: int, top_k: int, n_group: int | None, topk_group: int | None, intermediate_size: int, local_expert_offset: int, local_num_experts: int, routed_scaling_factor: float | None, routing_method_type: int = 0, do_finalize: bool = True, enable_pdl: bool | None = None, gated_act_type: int = 0, output: Tensor | None = None, tune_max_num_tokens: int = 8192) → List[Tensor]¶

FP4 block scale MoE operation.

Parameters:

routing_logits (torch.Tensor) – shape [seq_len, num_experts] Input tensor of routing logits. Supports float32, bfloat16.
routing_bias (Optional[torch.Tensor]) – shape [num_experts] Tensor of routing bias. Can be None for some routing methods. Must be the same type as routing logits.
hidden_states (torch.Tensor) – shape [seq_len, hidden_size // 2 if nvfp4 else hidden_size] Tensor of input hidden states. Supports bfloat16, mxfp8, and nvfp4 (packed into uint8)
hidden_states_scale (Optional[torch.Tensor]) – shape [seq_len, hidden_size // (32 if mxfp8, 16 if mxfp4)] Scale tensor of mxfp8 / nvfp4 hidden states. Dtype must be float8.
gemm1_weights (torch.Tensor) – shape [num_experts, 2 * intermediate_size, hidden_size // 2] Tensor of FC1 weights. Dtype must be uint8 (packed fp4)
gemm1_weights_scale (torch.Tensor) – shape [num_experts, 2 * intermediate_size, hidden_size // (32 if mxfp4 else 16)] Scale tensor of FC1 weights. Dtype must be float8.
gemm1_bias (Optional[torch.Tensor]) – shape [num_experts, 2 * intermediate_size] Tensor of FC1 biases. Dtype is float32.
gemm1_alpha (Optional[torch.Tensor]) – shape [num_experts] Tensor of swiglu alpha. Dtype is float32.
gemm1_beta (Optional[torch.Tensor]) – shape [num_experts] Tensor of swiglu beta. Dtype is float32.
gemm1_clamp_limit (Optional[torch.Tensor]) – shape [num_experts] Tensor of swiglu clamp limit. Dtype is float32.
gemm2_weights (torch.Tensor) – shape [num_experts, hidden_size, intermediate_size] Tensor of FC2 weights. Dtype must be uint8 (packed fp4)
gemm2_weights_scale (torch.Tensor) – shape [num_experts, hidden_size, intermediate_size // (32 if mxfp4 else 16)] Scale tensor of FC2 weights. Dtype must be float8.
gemm2_bias (Optional[torch.Tensor]) – shape [num_experts, hidden_size] Tensor of FC2 biases. Dtype is float32.
output1_scale_scalar (Optional[torch.Tensor]) – shape [local_num_experts] Tensor of scaling factors for first layer activation output
output1_scale_gate_scalar (Optional[torch.Tensor]) – shape [local_num_experts] Tensor of scaling factors for first layer gate output
output2_scale_scalar (Optional[torch.Tensor]) – shape [local_num_experts] Tensor of scaling factors for second layer output
num_experts (int) – Total number of experts
top_k (int) – Number of experts to route to per token
n_group (Optional[int]) – Number of expert groups (can be None for some routing methods)
topk_group (Optional[int]) – Number of groups to consider for top-k routing (can be None for some routing methods)
intermediate_size (int) – Size of intermediate layer
local_expert_offset (int) – Offset of local experts in global expert space
local_num_experts (int) – Number of experts handled by this device
routed_scaling_factor (Optional[float]) – Scaling factor for routing (can be None for some routing methods)
routing_method_type (int) – Type of routing method to use (default: 0) - 0: Default (Softmax -> TopK) - 1: Renormalize (TopK -> Softmax) - 2: DeepSeekV3 (Sigmoid -> RoutingBiasAdd -> Top2 in group -> Top4 groups -> Top8 experts) - 3: Llama4 (Top1 -> Sigmoid) - 4: RenormalizeNaive (Softmax -> TopK -> Renormalize)
do_finalize (bool) – Whether to finalize the output (default: False)
enable_pdl (Optional[bool]) – Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
gated_act_type (int) – Type of gated activation function (default: 0) - 0: SwiGlu - 1: GeGlu
tune_max_num_tokens (int) – Maximum number of tokens for tuning. (default: 8192)
output (Optional[torch.Tensor]) – shape [seq_len, hidden_size] Optional inplace output tensor.

Returns:

List of output tensors. If do_finalize=True, returns the final MoE output.: Otherwise, returns intermediate results (gemm2_output, expert_weights, expanded_idx_to_permuted_idx) that need further processing.

Return type:

List[torch.Tensor]