flashinfer.quantization.shuffle_matrix_sf_a

flashinfer.quantization.shuffle_matrix_sf_a(input_tensor: Tensor, epilogue_tile_m: int, num_elts_per_sf: int = 16)

CUDA implementation of TRT-LLM-gen shuffleMatrixSfA for linear-layout SF.

Unlike upstream shuffleMatrixSfA (which both reads and writes the 128x4 layout), this routine expects input_tensor in the linear layout that is used by quantized NVFP4 checkpoints. No padding is added.

Parameters:
  • input_tensor (torch.Tensor) – Scale-factor tensor in linear layout.

  • epilogue_tile_m (int) – Epilogue tile size along the M dimension; determines the row permutation.

  • num_elts_per_sf (int) – Number of elements per scale-factor vector. Defaults to 16.

Returns:

Row-shuffled scale-factor tensor, re-interleaved into the 128x4 layout.

Return type:

torch.Tensor