Onboarding the federated learning into multi-cluster environment(ACM) #1

yanmxa · 2024-12-17T02:40:11Z

Containerize the Collaborator and Aggregator Client

Training a neural network often involves third-party packages like torch, which significantly increase the image size. In many cases, the image size can exceed 1GB.

For example:

299M    /Users/yanmeng/anaconda3/envs/fl/lib/python3.11/site-packages/torch

To simplify the initial setup, I used the sklearn package to build a lightweight Logistic Regression model for the startup.

Other issues when containerize the application:

Federated Learning

Operand: Namespaced or Cluster
Samples: https://github.com/yanmxa/federated-learning/tree/main/controller/config/samples
Different with the proposal
- Layer configuration is disabled
- Clients -> Client: Use a single "client" configuration instead of defining individual details for each client. This ensures scalability and flexibility by avoiding repetitive definitions.
- Remove client replicas:
  - Each client should not have replicas. Clients are split based on data distribution, not replication.
  - For a single data source, there should be exactly one client (model). Creating unnecessary replicas wastes training resources and can degrade accuracy during aggregation.
- Use the of placement to schedule the client to specific clusters(key of the Label or ClusterClaim from the managed cluster)
- Use the value of the Label or ClusterClaim from the managed cluster to locate the data metadata for the client

The text was updated successfully, but these errors were encountered:

yanmxa · 2024-12-17T06:35:01Z

TODO:

~~1. addon cluster -> fl - client~~

controller -> manifestwork resplicasset -> cluster namespace
namespaced scope -> clusterset, placement
result -> model save(s3, kubernetes ...)

yanmxa · 2025-01-06T06:53:52Z

Jan 5, 2025

The model with more parameters
Data Privacy -> Case
Detail the case before the demo

yanmxa · 2025-02-11T06:10:29Z

Feb 11, 2025

Multilayer

Does the customer require this design?
Review different Federated Learning frameworks—none of them currently support this kind of architecture.
Gaps:
- Federated Learning: Service B needs to function as both an aggregator and collaborator, but FL frameworks don’t support this dual functionality at the moment. This will require integrating these separate features, which may involve delving into the details and implementation of each FL framework.
- OCM: The Placement API needs to be able to schedule workloads on managed clusters. Alternatively, we may need to treat the managed cluster as a hub cluster to support this functionality, similar to the global hub scenarios.

Leaf nodes setup - is node -> hub FL process?

Note: Currently, we lack a secure connection between the client and server. If we allow external nodes or devices to run the client or collaborator, we’ll need to enable a TLS connection to secure the communication.

Reference other FL to involve the multicluster environment: Standardized Interface for FL Frameworks to Support Multi-Cluster Environments #3
Lack of security connection
Local Model Metrics: Add Observability and Metrics for Federated Learning API #4
Add more framework: [GSoC 2025] Privacy-preserving and efficient AI model training across multi-clusters open-cluster-management-io/ocm#825
Give a interface to use the pretrained model: eeb13ac

yanmxa · 2025-02-25T14:41:04Z

Answer 7: Add pre-trained model

Configure the init model

cat <<EOF | oc apply -f -
apiVersion: federation-ai.open-cluster-management.io/v1alpha1
kind: FederatedLearning
metadata:
  name: federated-learning-sample
spec:
  framework: flower
  server:
    ...
    storage:
      type: PersistentVolumeClaim
      name: model-pvc
      path: /data/models/model_round_1_2025-02-25-14-27-46.pth
      size: 2Gi
  client:
    ...
EOF

Verification

yanmxa changed the title ~~Onboarding the federated learning into multi-cluster environment(open-cluster-managment)~~ Onboarding the federated learning into multi-cluster environment(ACM) Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onboarding the federated learning into multi-cluster environment(ACM) #1

Onboarding the federated learning into multi-cluster environment(ACM) #1

yanmxa commented Dec 17, 2024 •

edited

Loading

yanmxa commented Dec 17, 2024

yanmxa commented Jan 6, 2025

yanmxa commented Feb 11, 2025 •

edited

Loading

yanmxa commented Feb 25, 2025 •

edited

Loading

Onboarding the federated learning into multi-cluster environment(ACM) #1

Onboarding the federated learning into multi-cluster environment(ACM) #1

Comments

yanmxa commented Dec 17, 2024 • edited Loading

Containerize the Collaborator and Aggregator Client

Federated Learning

yanmxa commented Dec 17, 2024

yanmxa commented Jan 6, 2025

yanmxa commented Feb 11, 2025 • edited Loading

yanmxa commented Feb 25, 2025 • edited Loading

Answer 7: Add pre-trained model

yanmxa commented Dec 17, 2024 •

edited

Loading

yanmxa commented Feb 11, 2025 •

edited

Loading

yanmxa commented Feb 25, 2025 •

edited

Loading