flashinfer.fp4_quantization¶
This module provides FP4 quantization operations for LLM inference, supporting various scale factor layouts and quantization formats.
Core Quantization Functions¶
|
Quantize input tensor to FP4 format. |
|
Quantize input tensor to NVFP4 format. |
|
Quantize batched input tensor to NVFP4 format. |
|
Swizzle block scale tensor for FP4 format. |
|
Convert E2M1 format tensor and UFP8 scale factors to float tensor. |
|
quantize batched input tensor to NVFP4 format with mask. |
FP4 KV Cache Quantization¶
GPU-accelerated quantization and dequantization for KV cache data using a linear (non-swizzled) block scale layout.
nvfp4_kv_dequantize(): SM80+ (Ampere and later)nvfp4_kv_quantize(): SM100+ (Blackwell and later)
|
GPU quantization to NVFP4 KV cache format with linear block scale layout. |
|
GPU dequantization of NVFP4 KV cache data with linear block scale layout. |
Matrix Shuffling Utilities¶
|
PyTorch equivalent of trtllm-gen shuffleMatrixA |
|
Cuda implementation of trtllm-gen shuffleMatrixSfA but with a caveat. |