Skip to content

Releases: kvcache-ai/ktransformers

v0.2.2rc2

01 Mar 14:47
Compare
Choose a tag to compare

Improve temperature arg support #721
Update newer torch version for docker #732
Fix numa cpu distribution #685
Add torch support for MOE #684

v0.2.2rc1

25 Feb 16:32
9c71bcb
Compare
Choose a tag to compare

Hi everyone, KTransformers has been updated to version V0.2.2. You can now try it by compiling from the GitHub repository source code. The release packages and Docker images are also being built and uploaded—stay tuned! The main updates in this release include:

  1. #659 Simplified MMLU Test Script and Scores: Quantization may affect model capabilities. We conducted MMLU tests, where the Marlin quantization + Q4KM score slightly dropped to 81 compared to the original full-precision score of 81.6. Note that these tests are still preliminary and should be treated as reference only. For details, visit Benchmark Documentation.
  2. #643 FP8 Kernel for Enhanced Precision and Performance: Model quantization and the weight loading method in v0.2.1 led to some precision loss. Version 0.2.2 introduces GPU-accelerated FP8 Triton kernels, offering higher precision while maintaining performance. The MMLU score for FP8+Q4KM improved to 81.5 with negligible performance impact. We also provide corresponding weight packing scripts. Further optimizations for flexible and efficient weight loading will follow.
  3. #657 Longer Context Support and Efficient FlashInfer MLA Operator:
    • Under 24GB VRAM, the supported context length has increased from 8K (v0.2.1) to up to 25K tokens (varies by use case), with further optimizations planned.
    • Optimized DeepSeek-V3/R1 model prefill phase: VRAM usage now scales linearly with context length.
    • Added support for matrix absorption during prefill (trading some performance for reduced KV Cache VRAM usage) and FlashInfer's MLA operator.
    • Chunk Prefill Optimization will soon be merged to the main branch to improve VRAM efficiency and performance for long-context scenarios.

Feel free to explore these updates and share your feedback!

v0.2.1.post1

18 Feb 14:02
09f5c5e
Compare
Choose a tag to compare
  1. fix precision bug imported in 0.2.1, add mmlu/mmlu_pro text, fix server #413

v0.2.1

15 Feb 08:35
65d73ea
Compare
Choose a tag to compare
  1. Update documentation/README. #307 #316
  2. Added Multi-GPU configuration tutorial. #254
  3. Consolidated installation guide. #307
  4. A new Triton MLA Kernel has been introduced for Linux #294

v0.2.0

10 Feb 06:02
7527619
Compare
Choose a tag to compare
  1. Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM
  2. Support use dual socket

v0.1.4

30 Aug 13:52
022b893
Compare
Choose a tag to compare

Bug fix

  1. Fix bug that ktransformers cannot offload whole layer in cpu.
  2. Update DeepseekV2‘s multi gpu yaml examples to evenly allocate layers.
  3. Update Docker file.
  4. Fix bug about Qwen2-57B can not loaded
  5. Fix bug with #66 , add requirements for uvicorn

v0.1.3

29 Aug 01:36
233bbb8
Compare
Choose a tag to compare
  1. support internlm2.5 for 1M Prompt under 24GB VRAM and 150GB DRAM(only local_chat)
  2. decrease DeepseekV2's required VRAM from 20G to 10G.
  3. fix bugs as #51 #52 #56

v0.1.2

15 Aug 17:39
77a34c2
Compare
Choose a tag to compare
  1. Support windows native. #4
  2. Support multiple GPU. #8
  3. Support llamfile as linear backend.
  4. Support new model: mixtral 8 * 7B and 8 * 22B
  5. Support q2k, q3k, q5k dequant on gpu. #16
  6. Support github action to create pre compile package
  7. Support shared memory in different operator
  8. Fix some bugs on build from source #23

v0.1.1

01 Aug 04:40
5e83bc0
Compare
Choose a tag to compare
  1. support multiple cpu architecture pre compiled wheel package
  2. pre compile wheel package support multiple TORCH_CUDA_ARCH_LIST as "8.0;8.6;8.7;8.9"
  3. test and support python 3.10
  4. add a dockerfile to build docker image
  5. update README.md to support docker (In Progress: upload docker image)
  6. update version to 0.1.1

0.1.0

29 Jul 13:19
2562082
Compare
Choose a tag to compare
  1. Complete the submission information for PyPI.
  2. Support for dynamically detecting the current environment of the client. If precompiled packages can be used for installation, download and install the precompiled packages. (Adapted from flash-attn)
  3. Modified the installation process in the README.