flashinfer.quantization.block_scale_interleave

flashinfer.quantization.block_scale_interleave(unswizzled_sf: Tensor) Tensor

Swizzle a block-scale tensor for FP4 layouts.

Reorders an unswizzled FP4 block-scale tensor to optimize memory access patterns for FP4 GEMM/MoE kernels. The output is padded in the m dimension to a multiple of 128.

Parameters:

unswizzled_sf (torch.Tensor) – Input scale-factor tensor with dtype uint8 or bfloat16.

Returns:

1D flattened swizzled scale-factor buffer of shape (num_experts * expert_out_size,) where num_experts is unswizzled_sf.shape[0] for 3D inputs (and 1 otherwise) and expert_out_size is the padded swizzled size returned by _compute_swizzled_layout_sf_size. Note that this is not the same logical shape as unswizzled_sf; downstream FP4 GEMM/MoE kernels consume the flat buffer directly.

Return type:

torch.Tensor

Raises:

AssertionError – If the input dtype is not uint8 or bfloat16.