Skip to content

Commit

Permalink
correct with grammar check
Browse files Browse the repository at this point in the history
  • Loading branch information
chesterxgchen committed Feb 23, 2025
1 parent 56e825d commit 6716ec1
Show file tree
Hide file tree
Showing 34 changed files with 355 additions and 719 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,32 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Recap: Runing Federated Learning Applications\n",
"\n",
"\n",
"In this chapter, we will explore the process of running federated learning applications. We will start by setting up the environment and preparing the data, followed by training a classifier using PyTorch. We will then convert deep learning models to federated learning, customize server and client logic, and setup track experiments. Finally, we will delve into the job structure and configurations, including running a simulator, and conclude with a recap of the covered topics.\n",
"# Running Federated Learning Applications\n",
"\n",
"In this chapter, we will explore the process of running federated learning applications. We will start by setting up the environment and preparing the data, followed by training a classifier using PyTorch. We will then convert deep learning models to federated learning, customize server and client logic, and set up experiment tracking. Finally, we will delve into the job structure and configurations, including running a simulator, and conclude with a recap of the covered topics.\n",
"\n",
"1. **Running federated learning job**\n",
" * [Installation, prepare data](../01.1_running_federated_learning_job/setup.ipynb)\n",
" * [traing classifier with pytorch](../01.1_running_federated_learning_job/runing_pytorch_fl_job.ipynb)\n",
"\n",
"2. **From stand-alone-deep learning to Federated Learning**\n",
"\n",
" * [Convert deep learning with pytorch to federated leraning](../01.2_convert_deep_learning_to_federated_learning/convert_dl_to_fl.ipynb)\n",
"\n",
" * [Training classifier with PyTorch](../01.1_running_federated_learning_job/runing_pytorch_fl_job.ipynb)\n",
"\n",
"2. **How to Customize the Federated Algorithms**\n",
"2. **From stand-alone deep learning to Federated Learning**\n",
" * [Convert deep learning with PyTorch to federated learning](../01.2_convert_deep_learning_to_federated_learning/convert_dl_to_fl.ipynb)\n",
"\n",
" * [customize server logics](../01.3_customize_server_logics/customize_server_logics.ipynb)\n",
"3. **How to Customize the Federated Algorithms**\n",
" * [Customize server logic](../01.3_customize_server_logics/customize_server_logics.ipynb)\n",
"\n",
"4. **How to make adjustments to different traing parameters** \n",
"\n",
" * [customize client logics](../01.4_customize_client_training/customize_client_training.ipynb)\n",
"4. **How to make adjustments to different training parameters** \n",
" * [Customize client logic](../01.4_customize_client_training/customize_client_training.ipynb)\n",
"\n",
"5. **Tracking the training metrics** \n",
"\n",
" * [experiment tracking](../01.5_experiment_tracking/experiment_tracking.ipynb )\n",
" * [Experiment tracking](../01.5_experiment_tracking/experiment_tracking.ipynb)\n",
"\n",
"6. **Job structure and configurations**\n",
" * [Job structure & configuration](../01.6_job_structure_and_configuration/understanding_fl_job.ipynb)\n",
"\n",
" * [job structure & configuration ](../01.6_job_structure_and_configuration/understanding_fl_job.ipynb)\n",
"\n",
" \n",
"7. [Logging](../01.7_logging/logging.ipynb)\n",
"\n",
"\n",
"Let's get started with [Installation & data preparation](.././01.1_running_federated_learning_job/setup.ipynb)\n"
"Let's get started with [Installation & data preparation](../01.1_running_federated_learning_job/setup.ipynb)"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,28 +5,32 @@
"id": "7a5c3d67-a6ea-4f59-84d2-effc3ef016e1",
"metadata": {},
"source": [
"# Runing Federated Learning Job with PyTorch\n",
" # Running Federated Learning Job with PyTorch\n",
"\n",
"We have installed the NVIDIA FLARE, dependencies, download the data, look at the data split in [previous step](01.1.1_setup.ipynb), now we are going to look at the training. "
"We have installed NVIDIA FLARE and its dependencies, downloaded the data, and looked at the data split in the [previous step](01.1.1_setup.ipynb). Now we are going to look at the training.\n"
]
},
{
"cell_type": "markdown",
"id": "eb3f04b0",
"metadata": {},
"source": [
"## Run Federated Learning Training code\n",
" ## Run Federated Learning Training Code\n",
"\n",
"The training code essentially consists of three files:\n",
"- `fl_job.py`: the main job flow\n",
"- `client.py`: the client-side training code\n",
"- `network.py`: the network model definition\n",
"\n",
"The training code essentially consists of `fl_job.py` code, `client.py` the client-side training code, and `nn.py` network model. We use the `FedAvg` algorithm and workflow.\n",
"We use the `FedAvg` algorithm and workflow.\n",
"\n",
"```markdown\n",
"## Run Federated Learning Training\n",
"\n",
"The training code consists of three main scripts: `fl_job.py` for the overall job flow, `client.py` for the client-side training, and `network.py` for the network model. We use the built-in `FedAvg` algorithm for the server side worklow.\n",
"The training code consists of three main scripts: `fl_job.py` for the overall job flow, `client.py` for the client-side training, and `network.py` for the network model. We use the built-in `FedAvg` algorithm for the server-side workflow.\n",
"```\n",
"\n",
"to run the training in simulator we can simply execute the fl_job.py"
"To run the training in the simulator, we can simply execute `fl_job.py`.\n"
]
},
{
Expand Down Expand Up @@ -62,18 +66,18 @@
"id": "e823b5d4",
"metadata": {},
"source": [
"## 3. Access the logs and results\n",
" ## 3. Access the Logs and Results\n",
"\n",
"You can find the running logs and results inside the simulator's workspace:\n",
"You can find the running logs and results inside the simulator's workspace.\n",
"\n",
"noticed the \"fl_job.py\", we used the code \n",
"Notice that in `fl_job.py`, we used the code:\n",
"\n",
"```\n",
"```python\n",
"job.simulator_run(\"/tmp/nvflare/jobs/workdir\")\n",
"\n",
"```\n",
"\n",
"The \"/tmp/nvflare/jobs/workdir\" is the workspace directory of simulator\n"
"The \"/tmp/nvflare/jobs/workdir\" is the workspace directory of the simulator.\n",
"\n"
]
},
{
Expand All @@ -92,7 +96,7 @@
"id": "d4b62108",
"metadata": {},
"source": [
"We have successfully train a federated image classification model with pytorch. Next we need to take closer look the training codes and job structure. Let's go to [converting deep learning to federated learning](../01.1.2_convert_deep_learning_to_federated_learning/convert_dl_to_fl.ipynb)\n"
"We have successfully trained a federated image classification model with PyTorch. Next, we need to take a closer look at the training code and job structure. Let's go to [converting deep learning to federated learning](../01.1.2_convert_deep_learning_to_federated_learning/convert_dl_to_fl.ipynb)."
]
}
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
"source": [
"# Setup and Preparation\n",
"\n",
"This example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) to train an image classifier using federated averaging ([FedAvg](https://arxiv.org/abs/1602.05629))\n",
"This is an example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) to train an image classifier using federated averaging ([FedAvg](https://arxiv.org/abs/1602.05629))\n",
"and [PyTorch](https://pytorch.org/) as the deep learning training framework.\n",
"\n",
"\n",
"We will use the train script [cifar10_fl.py](src/cifar10_fl.py) and network [net.py](src/net.py) from the src directory.\n",
"\n",
"The dataset will be [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and will load its data within the client train code."
Expand Down Expand Up @@ -99,7 +100,7 @@
"id": "bcba6293",
"metadata": {},
"source": [
"The program just take a root dataset_path and download the training and test dataset to the given root directory from torchvision dataset. Let run the code. "
" The program just takes a root dataset_path and downloads the training and test datasets to the given root directory from the torchvision dataset. Let's run the code."
]
},
{
Expand Down Expand Up @@ -137,8 +138,8 @@
"source": [
"### Split the data\n",
"\n",
"In real-world scenarios, the data will be distributed among different clients/sides. Since we are simulating the real-world data, we need to split the data into different clients/sites. How to split the data, \n",
"depending on the type of problem or type of data. For simplicity, in this example we assume all clients will have the same data for horizontal federated learning cases.\n",
"In real-world scenarios, the data will be distributed among different clients/sites. Since we are simulating real-world data, we need to split the data into different clients/sites. How to split the data\n",
"depends on the type of problem or type of data. For simplicity, in this example we assume all clients will have the same data for horizontal federated learning cases.\n",
"Thus we do not do a data split, but rather point all clients to the same data location.\n",
"\n",
"\n",
Expand All @@ -148,6 +149,7 @@
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
Expand All @@ -156,7 +158,7 @@
"id": "316bae55",
"metadata": {},
"source": [
"Next Step, we will start to run training using simulation: [run pytorch federated learning job](../01.1_running_federated_learning_job/runing_pytorch_fl_job.ipynb)\n"
"Next step, we will start to run training using simulation: [run pytorch federated learning job](../01.1_running_federated_learning_job/runing_pytorch_fl_job.ipynb)\n"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
"source": [
"# PyTorch Deep Learning to Federated Learning Conversion\n",
"\n",
"One common question frequently heard from data scientists is how do I wrote a federated learning ? If I already have training code already for deep learning? how do I write an federated learning training code for the same problem?\n",
"\n",
"In this section, we will look at the classification training code we ran earlier and see how to convert the existing the pytorch training script to federated Learning client training code\n",
"One common question frequently heard from data scientists is \"how do I write federated learning code? If I already have training code for deep learning, how do I write federated learning training code for the same problem?\"\n",
"\n",
"In this section, we will look at the classification training code we ran earlier and see how to convert the existing PyTorch training script to federated learning client training code.\n",
"\n",
"\n",
"## Orginal Deep learning Training Script"
Expand Down Expand Up @@ -61,21 +62,24 @@
"\n",
"we call \n",
"\n",
"```\n",
"```python\n",
"flare.init()\n",
"```\n",
"\n",
"Once the flare is initialized, we will recieve some system metadata for example\n",
"```\n",
"\n",
"```python\n",
"\n",
" sys_info = flare.system_info()\n",
" client_name = sys_info[\"site_name\"]\n",
"\n",
"```\n",
"We can get current client's \"identity\". \n",
"\n",
"Next we need to extends the trainig beyond local iterations. Image the Federated Learning is like the following for-loop: \n",
"Next we need to extend the training beyond local iterations. Imagine the Federated Learning is like the following for-loop:\n",
"\n",
"```python\n",
"\n",
"```\n",
"rounds = 5\n",
"for current_round in ranage (rounds):\n",
" \n",
Expand All @@ -94,33 +98,34 @@
"For each round: we need to receive and evaluate the global model. \n",
"\n",
"\n",
"**Step-4** Recive global model \n",
"**Step-4** Receive global model\n",
"\n",
"```python\n",
"\n",
"```\n",
" input_model = flare.receive()\n",
" round=input_model.current_round\n",
"\n",
" # update model based on global model\n",
" model.load_state_dict(input_model.params)\n",
"```\n",
"\n",
"**Step-5** Eveluate Global Model\n",
"**Step-5** Evaluate Global Model\n",
"\n",
"Since the local model is being updated with global model, the training procedure calculates the loss which evaluates the model\n",
"\n",
" Since the local model is being updated with global model, the training procedue caclate the loss which evaluate the model \n",
"\n",
"**Step-6** Send the local trained model back to aggregator\n",
"\n",
" we take the newly trained local model parameters as well as metadata, sned it back to aggregator. \n",
"We take the newly trained local model parameters as well as metadata, send it back to aggregator.\n",
"\n",
"```\n",
"```python\n",
"\n",
" output_model = flare.FLModel( params=model.cpu().state_dict(), meta={\"NUM_STEPS_CURRENT_ROUND\": steps},)\n",
"\n",
" flare.send(output_model)\n",
"```\n",
"\n",
"\n",
"With above steps, just a few lines of code changes, no code structural changes, we converted the pytorch deep learning code to federated learning with NVIDIA FLARE\n",
"With above steps, just a few lines of code changes, no code structural changes, we converted the PyTorch deep learning code to federated learning with NVIDIA FLARE.\n",
"\n",
"The complete code can be found at client.py"
]
Expand All @@ -140,19 +145,19 @@
"id": "7f1824bf",
"metadata": {},
"source": [
"Now, we converted the client pytorch training script to federated learning code. Lets look further to handle multi-task client code\n",
"Now, we converted the client PyTorch training script to federated learning code. Let's look further to handle multi-task client code.\n",
"\n",
"\n",
"## Multi-Task Client Scripts\n",
"\n",
"So far, the client only handles traing, regardless what tasks the server issues to the clients. What if there are many tasks ? Client should take different actions based on the different tasks. Also, in previous version, we did not evaluate the global model. We are also to handle all these in this section. \n",
"So far, the client only handles training, regardless of what tasks the server issues to the clients. What if there are many tasks? Client should take different actions based on the different tasks. Also, in the previous version, we did not evaluate the global model. We are going to handle all these in this section.\n",
"\n",
"\n",
"In Flare's Client API, by detault, we will issue three different tasks: \"train\", \"evaluate\" and \"submit_model\"\n",
"In Flare's Client API, by default, we will issue three different tasks: \"train\", \"evaluate\" and \"submit_model\"\n",
"\n",
"These three tasks can be checked by \n",
"\n",
"```\n",
"```python\n",
"\n",
"flare.is_train()\n",
"\n",
Expand All @@ -162,7 +167,7 @@
"\n",
"```\n",
"\n",
"So we need to motify our existing training code to have both training and evaluation logics\n",
"So we need to modify our existing training code to have both training and evaluation logics.\n",
"\n",
"### Training logics changes\n",
"\n",
Expand All @@ -171,23 +176,25 @@
"\n",
"evaluate the local model: \n",
"\n",
"```\n",
"```python\n",
" # (5.2) evaluation on local trained model to save best model\n",
" local_accuracy = evaluate(net.state_dict())\n",
"\n",
"\n",
"```\n",
"\n",
"evalute the global model received \n",
"evaluate the global model received:\n",
"\n",
"```python\n",
"\n",
"```\n",
" # (5.3) evaluate on received model for model selection\n",
" accuracy = evaluate(input_model.params)\n",
"```\n",
"\n",
"Then add the global model accuracy into the metrics parameter of the FLModel before send it back to server. \n",
"\n",
"```\n",
"```python\n",
"\n",
" output_model = flare.FLModel(\n",
" params=net.cpu().state_dict(),\n",
" metrics={\"accuracy\": accuracy},\n",
Expand All @@ -201,7 +208,7 @@
">Note: the evaluate() function will discussed next\n",
"\n",
"\n",
"```\n",
"```python\n",
" \n",
"\n",
" # (5.2) evaluation on local trained model to save best model\n",
Expand Down Expand Up @@ -235,7 +242,7 @@
"The return value is accuracy percentage. \n",
"\n",
"\n",
"```\n",
"```python\n",
"\n",
" # wraps evaluation logic into a method to re-use for\n",
" # evaluation on both trained and received model\n",
Expand Down Expand Up @@ -270,7 +277,8 @@
"\n",
"The overall logics becomes\n",
"\n",
"```\n",
"```python\n",
"\n",
"if flare.is_training(): \n",
" traing and evaluate metrics\n",
" send model and merics back\n",
Expand Down
Loading

0 comments on commit 6716ec1

Please sign in to comment.