Skip to content

This work is a journal paper published at IEEE TSP in 2024, concentrating on improving the robustness to stragglers in distributed/federated learning with synchronous local SGD.

Notifications You must be signed in to change notification settings

fzhu0628/STSyn---Speeding-Up-Local-SGD-with-Straggler-Tolerant-Synchronization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

Overview

  • This is my second journal paper titled Speeding Up Local SGD with Straggler-Tolerant Synchronization.
  • The paper was published in IEEE Transactions on Signal Processing (TSP) in 2024.
  • About this paper:
    • Focuses on distributed/federated learning with synchronous local SGD.
    • Aims to improve robustness of federated systems against stragglers.
    • Proposes a novel local SGD strategy, STSyn, with the following key features:
      • Waits for the $K$ fastest workers while ensuring continuous computation for all workers.
      • Utilizes all effective (completed) local updates from every worker, even with stragglers.
    • Provides rigorous convergence rates for nonconvex objectives under both homogeneous and heterogeneous data distributions.
    • Validates the algorithm through simulations and investigates the impact of system hyperparameters.

Key Ideas of STSyn

  • The system consists of $M$ workers, each performing $U$ local updates per round.
  • The server waits for the $K$-th fastest worker to finish $U$ updates.
  • Key Concept: No worker stops computing until the $K$-th fastest one completes $U$ updates.
  • Workers that have completed at least one update upload their models to the server for aggregation.

Algorithm Illustration

image

  • Example: $M=4$, $U=3$, $K=3$.
    • Workers 1, 2, and 3 are the fastest $K=3$ workers to complete $U=3$ updates in round 0.
    • Red arrows: Additional updates performed by the fastest $K-1=2$ workers.
    • Light blue arrows: Straggling updates that are cancelled.
    • All 4 workers upload their models, as each completes at least one update.

Pseudocode

Below is the pseudocode for the STSyn algorithm:

image

Analysis

Average Wall-Clock Time and Number of Local Updates and Uploading Workers

  • Assuming the time for a single local update by each worker follows an exponential distribution, we provide closed-form expressions for:
    • The average wall-clock time per round.
    • The average number of local updates per worker per round.
    • The average number of uploading workers per round.

Convergence Analysis

  • Heterogeneous Data Distributions:

    • The average expected squared gradient norm for nonconvex objectives is upper bounded by:

      $$O\left(\frac{1}{\sqrt{K\bar{U} J}} + \frac{K}{\bar{U}^3 J}\right)$$

      where:

      • $K$: Number of workers the server waits for.
      • $\bar{U}$: Average local updates per worker per round.
      • $J$: Total number of communication rounds.
  • Homogeneous Data Distributions:

    • The convergence rate is the same as above:

      $$O\left(\frac{1}{\sqrt{K\bar{U} J}} + \frac{K}{\bar{U}^3 J}\right)$$

Simulation Results

image

  • Numerical experiments validate that the STSyn algorithm achieves superior time and communication efficiency under both i.i.d. and non-i.i.d. data distributions among workers.

For more details, refer to the full paper: IEEE Xplore.

Codes

  • Comparison.py:
    • Compares STSyn with state-of-the-art algorithms on the CIFAR-10 dataset using a three-layer CNN.
  • impact_K:
    • Explores the impact of the hyperparameter $K$.
  • impact_U:
    • Explores the impact of the hyperparameter $U$.

About

This work is a journal paper published at IEEE TSP in 2024, concentrating on improving the robustness to stragglers in distributed/federated learning with synchronous local SGD.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages