- Arxiv-CS-RAG: https://huggingface.co/spaces/bishmoy/Arxiv-CS-RAG
- Papes.cool:
- https://papers.cool/arxiv/cs.LG: Machine Learning
- https://papers.cool/arxiv/cs.DC: Distributed, Parallel, and Cluster Computing
- Find Related Papers
- Connect Papers: https://www.connectedpapers.com/
- Sematic Scholar: https://www.semanticscholar.org/
- SIGCOMM'25 Review results’ notification Tuesday April 29, 2025
- OSDI'25 Notification to authors: Tuesday, March 25, 2025
- Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution
- Fast State Restoration in LLM Serving with HCache
- Multiplexing Dynamic Deep Learning Workloads with SLO-awareness in GPU Clusters
- JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs
- Stateful Large Language Model Serving with Pensieve
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision
- BCP: A Unified Checkpointing System for Large Foundation Model Development
- Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
- Optimizing RLHF Training for Large Language Models with Stage Fusion
- Minder: Faulty Machine Detection for Large-scale Distributed Model Training
- Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters
- FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property
- Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism
- COMPSO: Optimizing Gradient Compression for Distributed Training with Second-Order Optimizers
- WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
- MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
- ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training
- Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
- EDA: Energy-Efficient Inter-Layer Model Compilation for Edge DNN Inference Acceleration
- BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
- DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
- LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
- PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
- NetLLM: Adapting Large Language Models for Networking
- Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
- Alibaba HPN: A Data Center Network for Large Language Model Training
- Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
- Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
- ASADI: Accelerating Sparse Attention Using Diagonal-based In-Situ Computing
- Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
- Enabling Large Dynamic Neural Network Training with Learning-based Memory Management
- LibPreemptible: Enabling Fast, Adaptive, and Hardware-Assisted User-Space Scheduling
- TinyTS: Memory-Efficient TinyML Model Compiler Framework on Microcontrollers
- GPU Scale-Model Simulation
- LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
- Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
- Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
- Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
- Reducing Energy Bloat in Large Model Training
- ReCycle: Pipeline Adaptation for the Resilient Distributed Training of Large DNNs
- Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
- Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10
- Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
- Power-aware Deep Learning Model Serving with μ-Serve
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
- PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
- Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
- Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
- Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
- FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
- Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
- Model Selection for Latency-Critical Inference Serving
- Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation
- CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs
- Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
- HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
- Blox: A Modular Toolkit for Deep Learning Schedulers
- DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
- GMorph: Accelerating Multi-DNN Inference via Model Fusion
- ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
- ZKML: An Optimizing System for ML Inference in Zero-Knowledge Proofs
- 8-bit Transformer Inference and Fine-tuning for Edge Accelerators
- AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
- Characterizing Power Management Opportunities for LLMs in the Cloud
- FaaSMem: Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture
- Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN Pruning
- FUYAO: DPU-enabled Direct Data Transfer for Serverless Computing
- NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
- PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
- SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
- SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve link []
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving link [PKU]
- Fairness in Serving Large Language Models link[Ion Stoica]
- ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models link[Serveless]
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- Llumnix: Dynamic Scheduling for Large Language Model Serving
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
- Fairness in Serving Large Language Models
- MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
- MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
- to be updated
- (NSDI'24) MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs link [Training|Bytedance]
- (NSDI'24) DISTMM: Accelerating Distributed Multimodal Model Training link [multi-model|Amazon]
- (NSDI'24) Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models link []
- (NSDI'24) Swing: Short-cutting Rings for Higher Bandwidth Allreduce link [Allreduce]
- (NSDI'24) Vulcan: Automatic Query Planning for Live ML Analytics link [Planning]
- (NSDI'24) CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters link [Communication]
- (NSDI'24) Towards Domain-Specific Network Transport for Distributed DNN Training link [Training | DNN]
- (MLSys'24) HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices paper [Inference | Parallelism | NUS]
- (MLSys'24) FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics paper [Inference | Tsinghua | SJTU]
- (MLSys'24) VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE - [Inference | Simulation Framework | Microsoft]
- (MLSys'24) UniDM: A Unified Framework for Data Manipulation with Large Language Models paper [Inference | Memory | Long Context | Alibaba]
- (MLSys'24) SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models paper [Serving | MoE]
- (MLSys'24) Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference paper [Inference | KV Cache]
- (MLSys'24) Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache paper [Inference | KV Cache]
- (MLSys'24) Prompt Cache: Modular Attention Reuse for Low-Latency Inference paper [Inference | KV Cache | Yale]
- (MLSys'24) SLoRA: Scalable Serving of Thousands of LoRA Adapters paper code [Serving | LoRA | Stanford | Berkerley]
- (MLSys'24) Punica: Multi-Tenant LoRA Serving paper code [Serving | LoRA | Tianqi Chen]
- (MLSys'24) AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration paper code [Quantization | MIT]
- (MLSys'24) Efficient Post-training Quantization with FP8 Formats paper [Quantization | Intel]
- (MLSys'24) Does Compressing Activations Help Model Parallel Training? paper [Quantization]
- (MLSys'24) Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving paper code [Quantization | Serving | SJTU | CMU]
- (MLSys'24) QMoE: Sub-1-Bit Compression of Trillion Parameter Models paper code [Quantization | MoE | Google]
- (MLSys'24) Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication - [Training | MoE | HKU]
- (MLSys'24) DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines - [Training | Diffusion | HKU]
- (MLSys'24) FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms paper code [MLsys | UIUC]
- (MLSys'24) ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time paper [Compiling | Batching | CMU]
- (MLSys'24) On Latency Predictors for Neural Architecture Search paper [Google]
- (MLSys'24) vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs paper [DNN Inference | PKU]
- (arxiv) RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation paper
- (NSDI'24) Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining paper
- (Sigcomm'24) CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving paper
- (EuroSys'25) CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion paper code
- (EuroSys'25) Fast State Restoration in LLM Serving with HCache paper
- (OSDI'24) Parrot: Efficient Serving of LLM-based Applications with Semantic Variable paper
- (arxiv) RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation paper code
- (MLSys'24) Prompt Cache: Modular Attention Reuse for Low-Latency Inference paper [Inference | KV Cache | Yale]