flashinfer.fused_moe.trtllm_mxint4_block_scale_routed_moe¶

flashinfer.fused_moe.trtllm_mxint4_block_scale_routed_moe(topk_ids: Tensor, hidden_states: Tensor, gemm1_weights: Tensor, gemm1_weights_scale: Tensor, gemm1_alpha: Tensor | None, gemm1_beta: Tensor | None, gemm1_clamp_limit: Tensor | None, gemm2_weights: Tensor, gemm2_weights_scale: Tensor, num_experts: int, top_k: int, n_group: int | None, topk_group: int | None, intermediate_size: int, local_expert_offset: int, local_num_experts: int, routed_scaling_factor: float | None, routing_method_type: int = 0, do_finalize: bool = True, enable_pdl: bool | None = None, gemm1_lora_delta: Tensor | None = None, output: Tensor | None = None, tune_max_num_tokens: int = 8192) → List[Tensor]¶

MxInt4 block-scale MoE with pre-computed routing.

Same FC1/FC2 kernel and LoRA contract as trtllm_mxint4_block_scale_moe(), but the caller supplies pre-computed top-k routing instead of raw routing logits. This skips the routing kernel’s top-k computation and reuses the BF16-routed packed-int32 contract for topk_ids.

Parameters:

topk_ids (torch.Tensor) – [seq_len, top_k] int32 tensor of packed expert indices and weights: (expert_id << 16) | (weight_bf16.view(int16)).
hidden_states (torch.Tensor) – [seq_len, hidden_size] bfloat16 input activations.
gemm1_weights (torch.Tensor) – [num_experts, 2 * intermediate_size, hidden_size // 2] packed MXINT4 weights, uint8.
gemm1_weights_scale (torch.Tensor) – [num_experts, 2 * intermediate_size, hidden_size // 32] FC1 weight scales, bfloat16.
gemm1_alpha (Optional[torch.Tensor]) – [num_experts] swiglu alpha, float32.
gemm1_beta (Optional[torch.Tensor]) – [num_experts] swiglu beta, float32.
gemm1_clamp_limit (Optional[torch.Tensor]) – [num_experts] swiglu clamp limit, float32.
gemm2_weights (torch.Tensor) – [num_experts, hidden_size, intermediate_size // 2] packed MXINT4 weights, uint8.
gemm2_weights_scale (torch.Tensor) – [num_experts, hidden_size, intermediate_size // 32] FC2 weight scales, bfloat16.
num_experts (int) – Total number of experts.
top_k (int) – Number of experts to route to per token.
n_group (Optional[int]) – Number of expert groups.
topk_group (Optional[int]) – Number of groups to consider for top-k routing.
intermediate_size (int) – FC1/FC2 inner dimension.
local_expert_offset (int) – Offset of local experts in the global expert space.
local_num_experts (int) – Number of experts handled by this device.
routed_scaling_factor (Optional[float]) – Optional output scaling factor.
routing_method_type (int) –
Routing method (default 0). Selects the routing-kernel pipeline; matches flashinfer.tllm_enums.RoutingMethodType.
- 0 Default — Softmax → TopK.
- 1 Renormalize — TopK → Softmax.
- 2 DeepSeekV3 — Sigmoid → RoutingBiasAdd → Top-2 in group → Top-topk_group groups → Top-top_k experts from the selected groups.
- 3 Llama4 — Top-1 → Sigmoid.
- 4 RenormalizeNaive — Softmax → TopK → Renormalize (Qwen3 style).
- 5 TopK — TopK only (no softmax/sigmoid).
- 6 SigmoidRenorm — Sigmoid → TopK → Renormalize (divide by the sum of the top-K weights).
- 7 MiniMax2 — Sigmoid + Bias → TopK → ScaledSumNormalize (routeScale = 1.0, epsilon = 1e-20).
- 8 Sigmoid — Sigmoid → TopK (no renormalization).
- 9 Unspecified — reserved.
do_finalize (bool) – Whether to run the finalize stage (default True).
enable_pdl (Optional[bool]) – Whether to enable Programmatic Dependent Launch.
gemm1_lora_delta (Optional[torch.Tensor]) – Optional MoE LoRA delta of shape [num_tokens, top_k, 2 * intermediate_size], bfloat16, in concatenated gate/up layout. When set, added to FC1 before SwiGLU and the post-activation buffer is appended to the return list.
output (Optional[torch.Tensor]) – Optional in-place output tensor.
tune_max_num_tokens (int) – Maximum number of tokens for autotuning (default 8192).

Returns:

Return shape depends on do_finalize and gemm1_lora_delta.

do_finalize	gemm1_lora_delta	Returned tensors
`True`	`None`	`[output]`
`True`	`Tensor`	`[output, expanded_idx_to_permuted_idx, gemm1_activation_output]`
`False`	`None`	`[gemm2_output, expert_weights, expanded_idx_to_permuted_idx]`
`False`	`Tensor`	`[gemm2_output, expert_weights, expanded_idx_to_permuted_idx, gemm1_activation_output]`

Return type:

List[torch.Tensor]