flashinfer.samplingΒΆ
Kernels for LLM sampling.
|
Fused GPU kernel for category sampling from probabilities. |
|
Fused GPU kernel for top-p sampling (nucleus sampling) from probabilities, this operator implements GPU-based rejection sampling without explicit sorting. |
|
Fused GPU kernel for top-k sampling from probabilities, this operator implements GPU-based rejection sampling without explicit sorting. |
|
Fused GPU kernel for min_p sampling from probabilities, |
|
Fused GPU kernel for top-k and top-p sampling from pre-softmax logits, |
|
Fused GPU kernel for top-k and top-p sampling from probabilities, |
|
Fused GPU kernel for renormalizing probabilities by top-p thresholding. |
|
Fused GPU kernel for renormalizing probabilities by top-k thresholding. |
|
Fused GPU kernel for masking logits by top-k thresholding. |
|
Fused-GPU kernel for speculative sampling for sequence generation (proposed in paper Accelerating Large Language Model Decoding with Speculative Sampling), where the draft model generates a sequence(chain) of tokens for each request. |