flashinfer.fp4_quantization

This module provides FP4 quantization operations for LLM inference, supporting various scale factor layouts and quantization formats.

Core Quantization Functions

fp4_quantize(input[, global_scale, ...])

Quantize input tensor to FP4 format.

nvfp4_quantize(a, a_global_sf[, sfLayout, ...])

Quantize input tensor to NVFP4 format.

nvfp4_block_scale_interleave(unswizzled_sf)

Swizzle block scale tensor for FP4 format.

e2m1_and_ufp8sf_scale_to_float(e2m1_tensor, ...)

Convert E2M1 format tensor and UFP8 scale factors to float tensor.

Matrix Shuffling Utilities

shuffle_matrix_a(input_tensor, epilogue_tile_m)

PyTorch equivalent of trtllm-gen shuffleMatrixA

shuffle_matrix_sf_a(input_tensor, ...[, ...])

Cuda implementation of trtllm-gen shuffleMatrixSfA but with a caveat.

Types and Enums

SfLayout(value[, names, module, qualname, ...])

Layout of scale factors for NVFP4.