awesome-papers

useful website

Arxiv-CS-RAG: https://huggingface.co/spaces/bishmoy/Arxiv-CS-RAG
Papes.cool:
- https://papers.cool/arxiv/cs.LG: Machine Learning
- https://papers.cool/arxiv/cs.DC: Distributed, Parallel, and Cluster Computing
Find Related Papers
- Connect Papers: https://www.connectedpapers.com/
- Sematic Scholar: https://www.semanticscholar.org/

TODO list

SIGCOMM'25 Review results’ notification Tuesday April 29, 2025
OSDI'25 Notification to authors: Tuesday, March 25, 2025

arxiv papers (update daily)

See daily-arxiv-llm.md

EuroSys 2025 Spring

Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution
Fast State Restoration in LLM Serving with HCache
Multiplexing Dynamic Deep Learning Workloads with SLO-awareness in GPU Clusters
JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs
Stateful Large Language Model Serving with Pensieve
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision
BCP: A Unified Checkpointing System for Large Foundation Model Development

NSDI 2025

Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
Optimizing RLHF Training for Large Language Models with Stage Fusion
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters

PPoPP 2025

Deep Neural Networks

FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property
Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism
COMPSO: Optimizing Gradient Compression for Distributed Training with Second-Order Optimizers

Large Language Models

WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training

HPCA 2025

Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
EDA: Energy-Efficient Inter-Layer Model Compilation for Edge DNN Inference Acceleration
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM

SIGCOMM 2024

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Networking and Training

NetLLM: Adapting Large Language Models for Networking
Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
Alibaba HPN: A Data Center Network for Large Language Model Training
Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

HPCA 2024

Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
ASADI: Accelerating Sparse Attention Using Diagonal-based In-Situ Computing
Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
Enabling Large Dynamic Neural Network Training with Learning-based Memory Management
LibPreemptible: Enabling Fast, Adaptive, and Hardware-Assisted User-Space Scheduling
TinyTS: Memory-Efficient TinyML Model Compiler Framework on Microcontrollers
GPU Scale-Model Simulation

SOSP 2024

ML Inference

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

ML Training

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
Reducing Energy Bloat in Large Model Training
ReCycle: Pipeline Adaptation for the Resilient Distributed Training of Large DNNs

Other

Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10
Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor

ATC 2024

ML Inference

Power-aware Deep Learning Model Serving with μ-Serve
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs

ML Training

Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences

EuroSys 2024

Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
Model Selection for Latency-Critical Inference Serving
Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation
CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
Blox: A Modular Toolkit for Deep Learning Schedulers
DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
GMorph: Accelerating Multi-DNN Inference via Model Fusion
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
ZKML: An Optimizing System for ML Inference in Zero-Knowledge Proofs

ASPLOS 2024

8-bit Transformer Inference and Fine-tuning for Edge Accelerators
AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
Characterizing Power Management Opportunities for LLMs in the Cloud
FaaSMem: Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture
Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN Pruning
FUYAO: DPU-enabled Direct Data Transfer for Serverless Computing
NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration

OSDI 2024

LLM Serving

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve link []
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving link [PKU]
Fairness in Serving Large Language Models link[Ion Stoica]
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models link[Serveless]
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Llumnix: Dynamic Scheduling for Large Language Model Serving

ML Scheduling

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
Fairness in Serving Large Language Models
MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

NSDI 2024

to be updated
(NSDI'24) MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs link [Training|Bytedance]
(NSDI'24) DISTMM: Accelerating Distributed Multimodal Model Training link [multi-model|Amazon]
(NSDI'24) Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models link []
(NSDI'24) Swing: Short-cutting Rings for Higher Bandwidth Allreduce link [Allreduce]
(NSDI'24) Vulcan: Automatic Query Planning for Live ML Analytics link [Planning]
(NSDI'24) CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters link [Communication]
(NSDI'24) Towards Domain-Specific Network Transport for Distributed DNN Training link [Training | DNN]

MLSys 2024

Accepted Papers

LLM - serving

(MLSys'24) HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices paper [Inference | Parallelism | NUS]
(MLSys'24) FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics paper [Inference | Tsinghua | SJTU]
(MLSys'24) VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE - [Inference | Simulation Framework | Microsoft]
(MLSys'24) UniDM: A Unified Framework for Data Manipulation with Large Language Models paper [Inference | Memory | Long Context | Alibaba]
(MLSys'24) SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models paper [Serving | MoE]
(MLSys'24) Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference paper [Inference | KV Cache]
(MLSys'24) Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache paper [Inference | KV Cache]
(MLSys'24) Prompt Cache: Modular Attention Reuse for Low-Latency Inference paper [Inference | KV Cache | Yale]
(MLSys'24) SLoRA: Scalable Serving of Thousands of LoRA Adapters paper code [Serving | LoRA | Stanford | Berkerley]
(MLSys'24) Punica: Multi-Tenant LoRA Serving paper code [Serving | LoRA | Tianqi Chen]

LLM - training and quantization

(MLSys'24) AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration paper code [Quantization | MIT]
(MLSys'24) Efficient Post-training Quantization with FP8 Formats paper [Quantization | Intel]
(MLSys'24) Does Compressing Activations Help Model Parallel Training? paper [Quantization]
(MLSys'24) Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving paper code [Quantization | Serving | SJTU | CMU]
(MLSys'24) QMoE: Sub-1-Bit Compression of Trillion Parameter Models paper code [Quantization | MoE | Google]
(MLSys'24) Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication - [Training | MoE | HKU]
(MLSys'24) DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines - [Training | Diffusion | HKU]

ML Serving

(MLSys'24) FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms paper code [MLsys | UIUC]
(MLSys'24) ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time paper [Compiling | Batching | CMU]
(MLSys'24) On Latency Predictors for Neural Architecture Search paper [Google]
(MLSys'24) vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs paper [DNN Inference | PKU]

Retrieval-Augmented Generation

(arxiv) RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation paper
(NSDI'24) Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining paper
(Sigcomm'24) CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving paper
(EuroSys'25) CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion paper code
(EuroSys'25) Fast State Restoration in LLM Serving with HCache paper
(OSDI'24) Parrot: Efficient Serving of LLM-based Applications with Semantic Variable paper
(arxiv) RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation paper code
(MLSys'24) Prompt Cache: Modular Attention Reuse for Low-Latency Inference paper [Inference | KV Cache | Yale]

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github/workflows		.github/workflows
tools		tools
README.md		README.md
daily-arxiv-llm.md		daily-arxiv-llm.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-papers

useful website

TODO list

arxiv papers (update daily)

EuroSys 2025 Spring

NSDI 2025

PPoPP 2025

Deep Neural Networks

Large Language Models

HPCA 2025

SIGCOMM 2024

Networking and Training

HPCA 2024

SOSP 2024

ML Inference

ML Training

Other

ATC 2024

ML Inference

ML Training

EuroSys 2024

ASPLOS 2024

OSDI 2024

LLM Serving

ML Scheduling

NSDI 2024

MLSys 2024

LLM - serving

LLM - training and quantization

ML Serving

Retrieval-Augmented Generation

About

Releases

Packages

Contributors 6

Languages

TJU-NSL/awesome-papers

Folders and files

Latest commit

History

Repository files navigation

awesome-papers

useful website

TODO list

arxiv papers (update daily)

Deep Neural Networks

Large Language Models

Networking and Training

ML Inference

ML Training

Other

ML Inference

ML Training

LLM Serving

ML Scheduling

LLM - serving

LLM - training and quantization

ML Serving

Retrieval-Augmented Generation

About

Resources

Stars

Watchers

Forks

Languages