flashinfer.fp4_quantization¶

This module provides FP4 quantization operations for LLM inference, supporting various scale factor layouts and quantization formats.

Core Quantization Functions¶

`fp4_quantize`(input[, global_scale, ...])	Quantize input tensor to FP4 format.
`nvfp4_quantize`(a, a_global_sf[, sfLayout, ...])	Quantize input tensor to NVFP4 format.
`nvfp4_batched_quantize`(a, a_global_sf[, ...])	Quantize batched input tensor to NVFP4 format.
`nvfp4_block_scale_interleave`(unswizzled_sf)	Swizzle block scale tensor for FP4 format.
`e2m1_and_ufp8sf_scale_to_float`(e2m1_tensor, ...)	Convert E2M1 format tensor and UFP8 scale factors to float tensor.
`scaled_fp4_grouped_quantize`(a, mask, a_global_sf)	quantize batched input tensor to NVFP4 format with mask.

Matrix Shuffling Utilities¶

`shuffle_matrix_a`(input_tensor, epilogue_tile_m)	PyTorch equivalent of trtllm-gen shuffleMatrixA
`shuffle_matrix_sf_a`(input_tensor, ...[, ...])	Cuda implementation of trtllm-gen shuffleMatrixSfA but with a caveat.

Types and Enums¶

SfLayout(value[, names, module, qualname, ...])

Layout of scale factors for NVFP4.