flashinfer.fused_moe.bgmv_moe_expand¶

flashinfer.fused_moe.bgmv_moe_expand(y: Tensor, x: Tensor, w_ptr: Tensor, sorted_token_ids: Tensor, expert_ids: Tensor, topk_weights: Tensor, lora_indices: Tensor, slice_start_loc: Tensor, output_slices: List[int], lora_stride: int, *, finalize: bool = True) → None¶

MoE LoRA expand operation: project through LoRA-B matrices.

With finalize=True (default), for each (token, expert) pair computes the routing-weighted combine into a per-token row:

y[token, col_offset:col_offset+feat] += topk_weight * (x[slice, pair, :] @ lora_b[expert, lora_id])

(y is [num_tokens, total_feat_out] and must be zero-initialized).

With finalize=False (FC1 LoRA delta), writes a per-pair, UNWEIGHTED result with a plain store — no topk_weight, no cross-expert combine:

y[pair, col_offset:col_offset+feat] = (x[slice, pair, :] @ lora_b[expert, lora_id])

(y is [num_pairs, total_feat_out]). Skipped pairs (lora_id < 0) early-return, so y MUST be zero-initialized by the caller (torch.zeros) to define those rows. topk_weights is ignored in this mode but must still be a valid [num_pairs] float32 tensor.

Parameters:

y – Output buffer (zero-initialized). [num_tokens, total_feat_out] (finalize) or [num_pairs, total_feat_out] (no-finalize). Float32.
x – Shrink output [num_slices, num_pairs, rank].
w_ptr – Pointer table [num_slices, num_experts] of int64.
sorted_token_ids – Token indices for each pair [num_pairs].
expert_ids – Expert indices for each pair [num_pairs].
topk_weights – Routing weights for each pair [num_pairs]. Float32. (Ignored when finalize=False.)
lora_indices – LoRA adapter ID for each token [num_tokens].
slice_start_loc – Column offset for each slice [num_slices]. Int64.
output_slices – Output feature dimension for each slice.
lora_stride – Stride between LoRA adapters in weight tensor.
finalize – Combine + weight per token (True) vs per-pair unweighted store (False).