flashinfer.fp4_quantization.shuffle_matrix_sf_a¶
- flashinfer.fp4_quantization.shuffle_matrix_sf_a(input_tensor: torch.Tensor, epilogue_tile_m: int, num_elts_per_sf: int = 16)¶
Cuda implementation of trtllm-gen shuffleMatrixSfA but with a caveat. shuffleMatrixSfA expects the input to be in 128x4 layout and then apply the same shuffling in shuffleMatrixA and writes out in 128x4 layout. This function expects the input to be in linear layout. It’s done this way because the scaling factors in the NVFP4 checkpoints are quantized and are in linear layout. This function doesn’t add padding.