flashinfer.fused_moe.trtllm_fp8_block_scale_moe

flashinfer.fused_moe.trtllm_fp8_block_scale_moe(routing_logits: Tensor, routing_bias: Tensor | None, hidden_states: Tensor, hidden_states_scale: Tensor, gemm1_weights: Tensor, gemm1_weights_scale: Tensor, gemm2_weights: Tensor, gemm2_weights_scale: Tensor, num_experts: int, top_k: int, n_group: int | None, topk_group: int | None, intermediate_size: int, local_expert_offset: int, local_num_experts: int, routed_scaling_factor: float | None, routing_method_type: int = 0, use_shuffled_weight: bool = False, weight_layout: int = 0, do_finalize: bool = True, enable_pdl: bool | None = None, tune_max_num_tokens: int = 8192, fp8_quantization_type: Fp8QuantizationType = Fp8QuantizationType.DeepSeekFp8, activation_type: int = 3, norm_topk_prob: bool = True, routing_replay_out: Tensor | None = None, gemm1_alpha: Tensor | None = None, gemm1_beta: Tensor | None = None, gemm1_clamp_limit: Tensor | None = None) List[Tensor] | Tensor

FP8 block-scaled MoE operation.

Parameters:
  • routing_logits (torch.Tensor) – [seq_len, num_experts] tensor of routing logits.

  • routing_bias (Optional[torch.Tensor]) – [num_experts] tensor of routing bias.

  • hidden_states (torch.Tensor) – [seq_len, hidden_size] tensor of input hidden states.

  • hidden_states_scale (torch.Tensor) – [hidden_size // 128, seq_len] tensor of hidden-states block scales.

  • gemm1_weights (torch.Tensor) – First-layer weights. [num_experts, M, hidden_size] when weight_layout == WeightLayout.MajorK (0), or [num_experts, M // 128, hidden_size, 128] when weight_layout == WeightLayout.BlockMajorK (2). M is 2 * intermediate_size for gated activations and intermediate_size for non-gated activations.

  • gemm1_weights_scale (torch.Tensor) – [num_experts, 2*intermediate_size // (32 if mxfp8 else 128), hidden_size // (32 if mxfp8 else 128)] first-layer block scales.

  • gemm2_weights (torch.Tensor) – Second-layer weights. [num_experts, hidden_size, intermediate_size] when weight_layout == WeightLayout.MajorK, or [num_experts, hidden_size // 128, intermediate_size, 128] when weight_layout == WeightLayout.BlockMajorK.

  • gemm2_weights_scale (torch.Tensor) – [num_experts, hidden_size // (32 if mxfp8 else 128), intermediate_size // (32 if mxfp8 else 128)] second-layer block scales.

  • num_experts (int) – Total number of experts.

  • top_k (int) – Number of experts to route to per token.

  • n_group (Optional[int]) – Number of expert groups.

  • topk_group (Optional[int]) – Number of groups to consider for top-k routing.

  • intermediate_size (int) – Size of the intermediate layer.

  • local_expert_offset (int) – Offset of local experts in the global expert space.

  • local_num_experts (int) – Number of experts handled by this device.

  • routed_scaling_factor (Optional[float]) – Scaling factor for routing.

  • routing_method_type (int) – Routing method (default 0). See trtllm_bf16_moe().

  • use_shuffled_weight (bool) – Whether to use the shuffled weight layout (default False).

  • weight_layout (int) –

    Weight layout for gemm1_weights / gemm2_weights; matches flashinfer.tllm_enums.WeightLayout. Allowed values for this function depend on fp8_quantization_type: DeepSeekFp8 accepts MajorK or BlockMajorK; MxFp8 requires MajorK. Default 0 (MajorK).

    • 0 MajorK — K-major, logical shape [Mn, K].

    • 1 MajorMn — M-major (A) / N-major (B), logical shape [K, Mn]. Not supported by this function.

    • 2 BlockMajorK — Blocked along K, logical shape [K / blockK, Mn, blockK] (blockK is fixed at 128 B). Only valid when ``fp8_quantization_type`` is ``DeepSeekFp8``.

  • do_finalize (bool) – Whether to finalize the output (default True).

  • enable_pdl (Optional[bool]) – Whether to enable Programmatic Dependent Launch. None (default) lets the runtime auto-select on SM90+.

  • tune_max_num_tokens (int) – Maximum number of tokens for autotuning (default 8192).

  • fp8_quantization_type (Fp8QuantizationType) – FP8 quantization scheme (default Fp8QuantizationType.DeepSeekFp8).

  • activation_type (int) – Activation type (default 3 — Swiglu). 3 Swiglu; 4 Geglu; 6 Relu2 (non-gated); 7 Identity.

  • norm_topk_prob (bool) – Whether to normalize the top-k probabilities (default True).

  • routing_replay_out (Optional[torch.Tensor]) – Optional int16 tensor of shape (num_tokens_or_larger, top_k) used to capture the selected expert IDs during routing. Column order matches topk_indices. When None (default) the kernel skips the write entirely. The buffer may be larger than num_tokens for CUDA-graph pre-allocation; only rows [0, num_tokens) are written.

  • gemm1_clamp_limit (gemm1_alpha / gemm1_beta /) – Optional [local_num_experts] float32 per-expert SwiGLU OA parameters. They are currently supported only for Fp8QuantizationType.MxFp8 with ActivationType.Swiglu. Any subset can be provided: gemm1_alpha=None uses alpha=1.0, gemm1_beta=None uses beta=0.0, and gemm1_clamp_limit=None applies no clamp. Let GEMM1 output be split as X1 (linear/up half) and X2 (gate half). If a clamp limit is provided, X1 = clamp(X1, -limit, limit) and X2 = clamp(X2, max=limit). The fused activation output is X2 * sigmoid(alpha * X2) * (X1 + beta). Pass raw values for MxFp8; no host-side scalar dequant-scale conversion is applied.

Returns:

Final MoE output when do_finalize is True, otherwise [gemm2_output, expert_weights, expanded_idx_to_permuted_idx].

Return type:

torch.Tensor or List[torch.Tensor]