flashinfer.quantization.block_scale_interleave¶
- flashinfer.quantization.block_scale_interleave(unswizzled_sf: Tensor) Tensor¶
Swizzle a block-scale tensor for FP4 layouts.
Reorders an unswizzled FP4 block-scale tensor to optimize memory access patterns for FP4 GEMM/MoE kernels. The output is padded in the
mdimension to a multiple of 128.- Parameters:
unswizzled_sf (torch.Tensor) – Input scale-factor tensor with dtype
uint8orbfloat16.- Returns:
1D flattened swizzled scale-factor buffer of shape
(num_experts * expert_out_size,)wherenum_expertsisunswizzled_sf.shape[0]for 3D inputs (and1otherwise) andexpert_out_sizeis the padded swizzled size returned by_compute_swizzled_layout_sf_size. Note that this is not the same logical shape asunswizzled_sf; downstream FP4 GEMM/MoE kernels consume the flat buffer directly.- Return type:
torch.Tensor
- Raises:
AssertionError – If the input dtype is not
uint8orbfloat16.