Skip to content

TJU-NSL/awesome-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-papers

useful website


TODO list

  • SIGCOMM'25 Review results’ notification Tuesday April 29, 2025
  • OSDI'25 Notification to authors: Tuesday, March 25, 2025

arxiv papers (update daily)


  • Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator Execution
  • Fast State Restoration in LLM Serving with HCache
  • Multiplexing Dynamic Deep Learning Workloads with SLO-awareness in GPU Clusters
  • JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs
  • Stateful Large Language Model Serving with Pensieve
  • CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
  • SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision
  • BCP: A Unified Checkpointing System for Large Foundation Model Development

  • Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
  • Optimizing RLHF Training for Large Language Models with Stage Fusion
  • Minder: Faulty Machine Detection for Large-scale Distributed Model Training
  • Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters

Deep Neural Networks

  • FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property
  • Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism
  • COMPSO: Optimizing Gradient Compression for Distributed Training with Second-Order Optimizers

Large Language Models

  • WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
  • MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
  • ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training

  • Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
  • EDA: Energy-Efficient Inter-Layer Model Compilation for Edge DNN Inference Acceleration
  • BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
  • DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
  • Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
  • LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
  • PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM

  • CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Networking and Training

  • NetLLM: Adapting Large Language Models for Networking
  • Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
  • Alibaba HPN: A Data Center Network for Large Language Model Training
  • Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

  • Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
  • ASADI: Accelerating Sparse Attention Using Diagonal-based In-Situ Computing
  • Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
  • Enabling Large Dynamic Neural Network Training with Learning-based Memory Management
  • LibPreemptible: Enabling Fast, Adaptive, and Hardware-Assisted User-Space Scheduling
  • TinyTS: Memory-Efficient TinyML Model Compiler Framework on Microcontrollers
  • GPU Scale-Model Simulation

ML Inference

  • LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
  • Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
  • Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

ML Training

  • Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
  • Reducing Energy Bloat in Large Model Training
  • ReCycle: Pipeline Adaptation for the Resilient Distributed Training of Large DNNs

Other

  • Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
  • Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10
  • Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor

ML Inference

  • Power-aware Deep Learning Model Serving with μ-Serve
  • Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
  • PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
  • Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs

ML Training

  • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
  • Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
  • FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences

  • Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
  • Model Selection for Latency-Critical Inference Serving
  • Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation
  • CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs
  • Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
  • HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
  • Blox: A Modular Toolkit for Deep Learning Schedulers
  • DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
  • GMorph: Accelerating Multi-DNN Inference via Model Fusion
  • ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
  • ZKML: An Optimizing System for ML Inference in Zero-Knowledge Proofs

  • 8-bit Transformer Inference and Fine-tuning for Edge Accelerators
  • AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
  • Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
  • Characterizing Power Management Opportunities for LLMs in the Cloud
  • FaaSMem: Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture
  • Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN Pruning
  • FUYAO: DPU-enabled Direct Data Transfer for Serverless Computing
  • NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
  • PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
  • SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
  • SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration

LLM Serving

  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve link []
  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving link [PKU]
  • Fairness in Serving Large Language Models link[Ion Stoica]
  • ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models link[Serveless]
  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
  • Llumnix: Dynamic Scheduling for Large Language Model Serving

ML Scheduling

  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
  • USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
  • Fairness in Serving Large Language Models
  • MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
  • MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

  • to be updated
  • (NSDI'24) MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs link [Training|Bytedance]
  • (NSDI'24) DISTMM: Accelerating Distributed Multimodal Model Training link [multi-model|Amazon]
  • (NSDI'24) Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models link []
  • (NSDI'24) Swing: Short-cutting Rings for Higher Bandwidth Allreduce link [Allreduce]
  • (NSDI'24) Vulcan: Automatic Query Planning for Live ML Analytics link [Planning]
  • (NSDI'24) CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters link [Communication]
  • (NSDI'24) Towards Domain-Specific Network Transport for Distributed DNN Training link [Training | DNN]

LLM - serving

  • (MLSys'24) HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices paper [Inference | Parallelism | NUS]
  • (MLSys'24) FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics paper [Inference | Tsinghua | SJTU]
  • (MLSys'24) VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE - [Inference | Simulation Framework | Microsoft]
  • (MLSys'24) UniDM: A Unified Framework for Data Manipulation with Large Language Models paper [Inference | Memory | Long Context | Alibaba]
  • (MLSys'24) SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models paper [Serving | MoE]
  • (MLSys'24) Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference paper [Inference | KV Cache]
  • (MLSys'24) Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache paper [Inference | KV Cache]
  • (MLSys'24) Prompt Cache: Modular Attention Reuse for Low-Latency Inference paper [Inference | KV Cache | Yale]
  • (MLSys'24) SLoRA: Scalable Serving of Thousands of LoRA Adapters paper code [Serving | LoRA | Stanford | Berkerley]
  • (MLSys'24) Punica: Multi-Tenant LoRA Serving paper code [Serving | LoRA | Tianqi Chen]

LLM - training and quantization

  • (MLSys'24) AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration paper code [Quantization | MIT]
  • (MLSys'24) Efficient Post-training Quantization with FP8 Formats paper [Quantization | Intel]
  • (MLSys'24) Does Compressing Activations Help Model Parallel Training? paper [Quantization]
  • (MLSys'24) Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving paper code [Quantization | Serving | SJTU | CMU]
  • (MLSys'24) QMoE: Sub-1-Bit Compression of Trillion Parameter Models paper code [Quantization | MoE | Google]
  • (MLSys'24) Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication - [Training | MoE | HKU]
  • (MLSys'24) DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines - [Training | Diffusion | HKU]

ML Serving

  • (MLSys'24) FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms paper code [MLsys | UIUC]
  • (MLSys'24) ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time paper [Compiling | Batching | CMU]
  • (MLSys'24) On Latency Predictors for Neural Architecture Search paper [Google]
  • (MLSys'24) vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs paper [DNN Inference | PKU]

Retrieval-Augmented Generation

  • (arxiv) RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation paper
  • (NSDI'24) Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining paper
  • (Sigcomm'24) CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving paper
  • (EuroSys'25) CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion paper code
  • (EuroSys'25) Fast State Restoration in LLM Serving with HCache paper
  • (OSDI'24) Parrot: Efficient Serving of LLM-based Applications with Semantic Variable paper
  • (arxiv) RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation paper code
  • (MLSys'24) Prompt Cache: Modular Attention Reuse for Low-Latency Inference paper [Inference | KV Cache | Yale]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published