flashinfer.activation.silu_and_mul_scaled_nvfp4_experts_quantize

flashinfer.activation.silu_and_mul_scaled_nvfp4_experts_quantize(a, mask, a_global_sf)

Fused SiLU + mul + per-expert NVFP4 quantization with a per-row mask.

Used by mixture-of-experts pipelines to fuse the SiLU-gated activation of each expert with NVFP4 quantization, applying mask to skip rows that do not belong to the current expert.

Parameters:
  • a (torch.Tensor) – Input tensor of shape [B, M, K] with dtype fp16/bf16.

  • mask (torch.Tensor) – Mask tensor applied before quantization (typically the expert-assignment mask).

  • a_global_sf (torch.Tensor) – Global scale factor of shape [1] with dtype float32.

Returns:

(x_q, sf) where x_q has logical shape [M, K/2, B] with dtype FLOAT4_E2M1X2 (the implementation permutes the [B, M, K/2] physical layout so that the batch dim is last, matching the grouped-GEMM expectation) and sf is the 6D swizzled scale-factor tensor of logical shape [32, 4, padded_M // 128, 4, padded_K // 64, B] viewed as float8_e4m3fn. padded_M rounds M up to a multiple of 128 and padded_K rounds K // sf_vec_size up to a multiple of 4. Here sf_vec_size is fixed at 16 (NVFP4), matching flashinfer.quantization.nvfp4_quantize().

Return type:

Tuple[torch.Tensor, torch.Tensor]