Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboarding the federated learning into multi-cluster environment(ACM) #1

Open
yanmxa opened this issue Dec 17, 2024 · 4 comments
Open

Comments

@yanmxa
Copy link
Owner

yanmxa commented Dec 17, 2024

Containerize the Collaborator and Aggregator Client

Training a neural network often involves third-party packages like torch, which significantly increase the image size. In many cases, the image size can exceed 1GB.

For example:

299M    /Users/yanmeng/anaconda3/envs/fl/lib/python3.11/site-packages/torch

To simplify the initial setup, I used the sklearn package to build a lightweight Logistic Regression model for the startup.

Other issues when containerize the application:

Federated Learning

  • Operand: Namespaced or Cluster
  • Samples: https://github.com/yanmxa/federated-learning/tree/main/controller/config/samples
  • Different with the proposal
    • Layer configuration is disabled

    • Clients -> Client: Use a single "client" configuration instead of defining individual details for each client. This ensures scalability and flexibility by avoiding repetitive definitions.

    • Remove client replicas:

      • Each client should not have replicas. Clients are split based on data distribution, not replication.
      • For a single data source, there should be exactly one client (model). Creating unnecessary replicas wastes training resources and can degrade accuracy during aggregation.
    • Use the of placement to schedule the client to specific clusters(key of the Label or ClusterClaim from the managed cluster)

    • Use the value of the Label or ClusterClaim from the managed cluster to locate the data metadata for the client

@yanmxa yanmxa changed the title Onboarding the federated learning into multi-cluster environment(open-cluster-managment) Onboarding the federated learning into multi-cluster environment(ACM) Dec 17, 2024
@yanmxa
Copy link
Owner Author

yanmxa commented Dec 17, 2024

TODO:

1. addon cluster -> fl - client

  1. controller -> manifestwork resplicasset -> cluster namespace

  2. namespaced scope -> clusterset, placement

  3. result -> model save(s3, kubernetes ...)

@yanmxa
Copy link
Owner Author

yanmxa commented Jan 6, 2025

Jan 5, 2025

  • The model with more parameters
  • Data Privacy -> Case
  • Detail the case before the demo

@yanmxa
Copy link
Owner Author

yanmxa commented Feb 11, 2025

Feb 11, 2025

  1. Multilayer
Image
  • Does the customer require this design?
    Review different Federated Learning frameworks—none of them currently support this kind of architecture.

  • Gaps:

    • Federated Learning: Service B needs to function as both an aggregator and collaborator, but FL frameworks don’t support this dual functionality at the moment. This will require integrating these separate features, which may involve delving into the details and implementation of each FL framework.
    • OCM: The Placement API needs to be able to schedule workloads on managed clusters. Alternatively, we may need to treat the managed cluster as a hub cluster to support this functionality, similar to the global hub scenarios.
  1. Leaf nodes setup - is node -> hub FL process?
Image

Note: Currently, we lack a secure connection between the client and server. If we allow external nodes or devices to run the client or collaborator, we’ll need to enable a TLS connection to secure the communication.

  1. Reference other FL to involve the multicluster environment: Standardized Interface for FL Frameworks to Support Multi-Cluster Environments #3

  2. Lack of security connection

  3. Local Model Metrics: Add Observability and Metrics for Federated Learning API #4

  4. Add more framework: [GSoC 2025] Privacy-preserving and efficient AI model training across multi-clusters open-cluster-management-io/ocm#825

  5. Give a interface to use the pretrained model: eeb13ac

@yanmxa
Copy link
Owner Author

yanmxa commented Feb 25, 2025

Answer 7: Add pre-trained model

  1. Configure the init model
cat <<EOF | oc apply -f -
apiVersion: federation-ai.open-cluster-management.io/v1alpha1
kind: FederatedLearning
metadata:
  name: federated-learning-sample
spec:
  framework: flower
  server:
    ...
    storage:
      type: PersistentVolumeClaim
      name: model-pvc
      path: /data/models/model_round_1_2025-02-25-14-27-46.pth
      size: 2Gi
  client:
    ...
EOF
  1. Verification
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant