flashinfer.fp4_quantization¶
This module provides FP4 quantization operations for LLM inference, supporting various scale factor layouts and quantization formats.
Core Quantization Functions¶
  | 
Quantize input tensor to FP4 format.  | 
  | 
Quantize input tensor to NVFP4 format.  | 
  | 
Quantize batched input tensor to NVFP4 format.  | 
  | 
Swizzle block scale tensor for FP4 format.  | 
  | 
Convert E2M1 format tensor and UFP8 scale factors to float tensor.  | 
  | 
quantize batched input tensor to NVFP4 format with mask.  | 
Matrix Shuffling Utilities¶
  | 
PyTorch equivalent of trtllm-gen shuffleMatrixA  | 
  | 
Cuda implementation of trtllm-gen shuffleMatrixSfA but with a caveat.  | 
Types and Enums¶
  | 
Layout of scale factors for NVFP4.  |