flashinfer.fp4_quantization¶
This module provides FP4 quantization operations for LLM inference, supporting various scale factor layouts and quantization formats.
Core Quantization Functions¶
|
Quantize input tensor to FP4 format. |
|
Quantize input tensor to NVFP4 format. |
|
Swizzle block scale tensor for FP4 format. |
|
Convert E2M1 format tensor and UFP8 scale factors to float tensor. |
Matrix Shuffling Utilities¶
|
PyTorch equivalent of trtllm-gen shuffleMatrixA |
|
Cuda implementation of trtllm-gen shuffleMatrixSfA but with a caveat. |
Types and Enums¶
|
Layout of scale factors for NVFP4. |