flashinfer.activation.silu_and_mul_scaled_nvfp4_experts_quantize¶
- flashinfer.activation.silu_and_mul_scaled_nvfp4_experts_quantize(a, mask, a_global_sf)¶
Fused SiLU + mul + per-expert NVFP4 quantization with a per-row mask.
Used by mixture-of-experts pipelines to fuse the SiLU-gated activation of each expert with NVFP4 quantization, applying
maskto skip rows that do not belong to the current expert.- Parameters:
a (torch.Tensor) – Input tensor of shape
[B, M, K]with dtype fp16/bf16.mask (torch.Tensor) – Mask tensor applied before quantization (typically the expert-assignment mask).
a_global_sf (torch.Tensor) – Global scale factor of shape
[1]with dtypefloat32.
- Returns:
(x_q, sf)wherex_qhas logical shape[M, K/2, B]with dtypeFLOAT4_E2M1X2(the implementation permutes the[B, M, K/2]physical layout so that the batch dim is last, matching the grouped-GEMM expectation) andsfis the 6D swizzled scale-factor tensor of logical shape[32, 4, padded_M // 128, 4, padded_K // 64, B]viewed asfloat8_e4m3fn.padded_MroundsMup to a multiple of 128 andpadded_KroundsK // sf_vec_sizeup to a multiple of 4. Heresf_vec_sizeis fixed at16(NVFP4), matchingflashinfer.quantization.nvfp4_quantize().- Return type:
Tuple[torch.Tensor, torch.Tensor]