Skip to content

Efficient ML/DL implementations across multiple domains with K3s multi-node training setup

License

Notifications You must be signed in to change notification settings

khaykingleb/research-playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Research Playground

Inspired by ashleve/lightning-hydra-template and NVIDIA/NeMo

It's a library I created for efficient ML/DL research on various tasks.

Key features:

  • 🚀 Production-ready training pipelines
  • 🧠 Actual model implementations
  • ⚡️ Easy configuration management with Hydra
  • 📊 Experiment tracking with Weights & Biases
  • 🔧 Modular architecture for quick prototyping
  • 🐳 Docker support for reproducible environments
  • ☸️ Multi-GPU training with K3s and Terraform (soon)

Getting Started

  1. Install asdf to manage different tools' runtime versions.

  2. Update .env.example to your needs.

  3. Setup your training Hydra config in configs/experiments/ folder.

  4. Choose between local development outside or inside Docker container.

    • Outside of Docker (not recommended):

      make init-local
      poetry shell && python3 src train --experiment <experiment_name>
    • Inside Docker:

      make init && make build && make run
      python3 src train --experiment <experiment_name>

Notes

  • Use make help to see all available commands.

  • Use python3 src --help to see all available CLI arguments.