v0.2.2rc1
Hi everyone, KTransformers has been updated to version V0.2.2. You can now try it by compiling from the GitHub repository source code. The release packages and Docker images are also being built and uploaded—stay tuned! The main updates in this release include:
- #659 Simplified MMLU Test Script and Scores: Quantization may affect model capabilities. We conducted MMLU tests, where the Marlin quantization + Q4KM score slightly dropped to 81 compared to the original full-precision score of 81.6. Note that these tests are still preliminary and should be treated as reference only. For details, visit Benchmark Documentation.
- #643 FP8 Kernel for Enhanced Precision and Performance: Model quantization and the weight loading method in v0.2.1 led to some precision loss. Version 0.2.2 introduces GPU-accelerated FP8 Triton kernels, offering higher precision while maintaining performance. The MMLU score for FP8+Q4KM improved to 81.5 with negligible performance impact. We also provide corresponding weight packing scripts. Further optimizations for flexible and efficient weight loading will follow.
- #657 Longer Context Support and Efficient FlashInfer MLA Operator:
- Under 24GB VRAM, the supported context length has increased from 8K (v0.2.1) to up to 25K tokens (varies by use case), with further optimizations planned.
- Optimized DeepSeek-V3/R1 model prefill phase: VRAM usage now scales linearly with context length.
- Added support for matrix absorption during prefill (trading some performance for reduced KV Cache VRAM usage) and FlashInfer's MLA operator.
- Chunk Prefill Optimization will soon be merged to the main branch to improve VRAM efficiency and performance for long-context scenarios.
Feel free to explore these updates and share your feedback!