flashinfer.quantization.shuffle_matrix_sf_a¶
- flashinfer.quantization.shuffle_matrix_sf_a(input_tensor: Tensor, epilogue_tile_m: int, num_elts_per_sf: int = 16)¶
CUDA implementation of TRT-LLM-gen
shuffleMatrixSfAfor linear-layout SF.Unlike upstream
shuffleMatrixSfA(which both reads and writes the 128x4 layout), this routine expectsinput_tensorin the linear layout that is used by quantized NVFP4 checkpoints. No padding is added.- Parameters:
input_tensor (torch.Tensor) – Scale-factor tensor in linear layout.
epilogue_tile_m (int) – Epilogue tile size along the M dimension; determines the row permutation.
num_elts_per_sf (int) – Number of elements per scale-factor vector. Defaults to
16.
- Returns:
Row-shuffled scale-factor tensor, re-interleaved into the 128x4 layout.
- Return type:
torch.Tensor