diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.0_introduction/04.2.0_introduction.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.0_introduction/04.2.0_introduction.ipynb
deleted file mode 100644
index d973d21d6a..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.0_introduction/04.2.0_introduction.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d3ef5590-c631-4aed-b892-999dd4a3eb5f",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.0_introduction/introduction.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.0_introduction/introduction.ipynb
new file mode 100644
index 0000000000..09602efb77
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.0_introduction/introduction.ipynb
@@ -0,0 +1,106 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1e333915-6759-4d82-9f83-2bccc42ca047",
+   "metadata": {},
+   "source": [
+    "# Introduction: Federated Language Models - from NLP to LLM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c96f5007-8c62-4d61-bd97-6b31a3f5b0db",
+   "metadata": {},
+   "source": [
+    "In this chapter, we will explore the federated learning applications on language models.\n",
+    "\n",
+    "Natural Language Processing (NLP) is a subfield of artificial intelligence, focuses on enabling computers to process and analyze natural language data. Recently, Large Language Models (LLMs) have emerged as a transformative force in the field of NLP, enabling AI to understand, generate, and interact with human language at an unprecedented scale. Models such as BERT and GPT is able to leverage vast amounts of text data and deep learning techniques to perform various linguistic tasks, including text generation, translation, summarization, and question-answering.\n",
+    "\n",
+    "The development of LLMs relies on robust training schemes that enable these models to capture linguistic structures, contextual dependencies, and semantic meanings. Common training methodologies include unsupervised pretraining on large text corpora, followed by further fine-tuning using supervised (supervised finetuning - SFT) or reinforcement learning (reinforcement learning from human feedback - RLHF) approaches, refining their capabilities for practical applications with human interactions.\n",
+    "\n",
+    "Further, when adapting to a particular downstream task, instead of making updates to all model parameters as SFT/RLHF which can be computationally expensive and memory-intensive, Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a more efficient approach.  Techniques such as Low-Rank Adaptation (LoRA), P-Tuning, and Adapter Layers enable fine-tuning by updating only a small subset of parameters, significantly reducing computational costs while maintaining performance. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4fb8b73a-5058-44fe-8ecc-e355ef7178fb",
+   "metadata": {},
+   "source": [
+    "In the following sections, we will start with federated learning using a smaller-scale BERT model, then we extend our study to more recent open-source LLMs and their SFT and PEFT in a federated finetuning scheme. And finally to address a major challenge in federated LLM training - communication efficiency, we further visit potential solutions including quantization and streaming, and we will conclude with a recap of the covered topics."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9bffcb04-6839-4463-b4a5-e1792024adce",
+   "metadata": {},
+   "source": [
+    "8.1. **Federated BERT**\n",
+    "\n",
+    "Task-specific model training with BERT in a federated setting\n",
+    "* [Federated NLP with BERT Model](../08.1_fed_bert/federated_nlp_with_bert.ipynb)\n",
+    "\n",
+    "8.2. **Federated LLM Training with SFT**\n",
+    "\n",
+    "Supervised Fine-Tuning and its role in adapting LLMs in federated learning\n",
+    "* [Federated LLM Tuning with SFT](../08.2_llm_sft/LLM_SFT.ipynb)\n",
+    "\n",
+    "8.3. **Federated LLM Training with PEFT**\n",
+    "\n",
+    "Importance of PEFT in adapting LLMs for specific tasks, which can be achieve in a federated setting\n",
+    "* [Federated LLM Tuning with PEFT](../08.3_llm_peft/LLM_PEFT.ipynb)\n",
+    "\n",
+    "8.4. **Model Transmission with Quantization**\n",
+    "\n",
+    "One major hurdle of adapting LLMs in federated learning is the significant communication burden when performing federated SFT. To reduce the message size, quantization method can be applied as filters.\n",
+    "* [Model Quantization for Transmission](../08.4_llm_quantization/LLM_quantization.ipynb)\n",
+    "\n",
+    "8.5 **Model Transmission with Streaming**\n",
+    "\n",
+    "While quantization reduced communication cost, system memory requirement is still high for prepareing the message on either side. Therefore, we enabled streaming capabilities for more efficient and robust model communication.\n",
+    "* [Message Streaming for Model Transmission](../08.5_llm_streaming/LLM_streaming.ipynb)\n",
+    "\n",
+    "8.6. **Recap**\n",
+    "\n",
+    "[Recap](../08.6_recap/recap.ipynb) for federated LLM applications and features"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8864f21-ce74-4adf-8b5b-240879424424",
+   "metadata": {},
+   "source": [
+    "Let's get started with [Federated NLP with BERT Model](../08.1_fed_bert/federated_nlp_with_bert.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ceaab09-f41e-41e4-8ecd-a784328b468a",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/ner_model_test.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/ner_model_test.py
new file mode 100644
index 0000000000..3bed4fdf20
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/ner_model_test.py
@@ -0,0 +1,96 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import pandas as pd
+import torch
+from seqeval.metrics import classification_report
+from src.data_sequence import DataSequence
+from src.nlp_models import BertModel, GPTModel
+from torch.utils.data import DataLoader
+
+os.environ["TOKENIZERS_PARALLELISM"] = "False"
+
+
+def data_split_args_parser():
+    parser = argparse.ArgumentParser(description="Perform model testing by loading the best global model")
+    parser.add_argument("--data_path", type=str, help="Path to data file")
+    parser.add_argument("--model_path", type=str, help="Path to workspace server folder")
+    parser.add_argument("--num_labels", type=int, help="Number of labels for the candidate dataset")
+    parser.add_argument("--model_name", type=str, default="bert-base-uncased", help="Model name")
+    return parser
+
+
+if __name__ == "__main__":
+    parser = data_split_args_parser()
+    args = parser.parse_args()
+    device = torch.device("cuda")
+
+    model_path = args.model_path
+    data_path = args.data_path
+    num_labels = args.num_labels
+    model_name = args.model_name
+    ignore_token = -100
+
+    df_test = pd.read_csv(os.path.join(data_path, "test.csv"))
+    # label and id conversion
+    labels = []
+    for x in df_test["labels"].values:
+        labels.extend(x.split(" "))
+    unique_labels = set(labels)
+    labels_to_ids = {k: v for v, k in enumerate(sorted(unique_labels))}
+    ids_to_labels = {v: k for v, k in enumerate(sorted(unique_labels))}
+
+    # model
+    if model_name == "bert-base-uncased":
+        model = BertModel(model_name=model_name, num_labels=num_labels).to(device)
+    elif model_name == "gpt2":
+        model = GPTModel(model_name=model_name, num_labels=num_labels).to(device)
+    else:
+        raise ValueError("model not supported")
+    model_weights = torch.load(os.path.join(model_path, "best_FL_global_model.pt"))
+    model.load_state_dict(state_dict=model_weights["model"])
+    tokenizer = model.tokenizer
+
+    # data
+    test_dataset = DataSequence(df_test, labels_to_ids, tokenizer=tokenizer, ignore_token=ignore_token)
+    test_loader = DataLoader(test_dataset, num_workers=4, batch_size=64, shuffle=False)
+
+    # validate
+    model.eval()
+    with torch.no_grad():
+        total_acc_test, total_loss_test, test_total = 0, 0, 0
+        test_y_pred, test_y_true = [], []
+        for test_data, test_label in test_loader:
+            test_label = test_label.to(device)
+            test_total += test_label.shape[0]
+            mask = test_data["attention_mask"].squeeze(1).to(device)
+            input_id = test_data["input_ids"].squeeze(1).to(device)
+            loss, logits = model(input_id, mask, test_label)
+
+            for i in range(logits.shape[0]):
+                # remove pad tokens
+                logits_clean = logits[i][test_label[i] != ignore_token]
+                label_clean = test_label[i][test_label[i] != ignore_token]
+                # calcluate acc and store prediciton and true labels
+                predictions = logits_clean.argmax(dim=1)
+                acc = (predictions == label_clean).float().mean()
+                total_acc_test += acc.item()
+                test_y_pred.append([ids_to_labels[x.item()] for x in predictions])
+                test_y_true.append([ids_to_labels[x.item()] for x in label_clean])
+    # metric summary
+    summary = classification_report(y_true=test_y_true, y_pred=test_y_pred, zero_division=0)
+    print(summary)
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/nlp_fl_job.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/nlp_fl_job.py
new file mode 100644
index 0000000000..79bfee4606
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/nlp_fl_job.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from src.nlp_models import BertModel, GPTModel
+
+from nvflare.app_common.widgets.intime_model_selector import IntimeModelSelector
+from nvflare.app_common.workflows.fedavg import FedAvg
+from nvflare.app_opt.pt.job_config.model import PTModel
+from nvflare.job_config.api import FedJob
+from nvflare.job_config.script_runner import ScriptRunner
+
+
+def define_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        default="Bert",
+        help="Which model to choose, either Bert or GPT",
+    )
+    return parser.parse_args()
+
+
+def main():
+    args = define_parser()
+    model_name = args.model_name
+
+    # Create the FedJob
+    if model_name.lower() == "bert":
+        num_clients = 4
+        job = FedJob(name="Bert", min_clients=num_clients)
+        train_model_name = "bert-base-uncased"
+        model = PTModel(BertModel(num_labels=3, model_name=train_model_name))
+        output_path = "Bert"
+    elif model_name.lower() == "gpt":
+        num_clients = 2
+        job = FedJob(name="GPT", min_clients=num_clients)
+        train_model_name = "gpt2"
+        model = PTModel(GPTModel(num_labels=3, model_name=train_model_name))
+        output_path = "GPT"
+    else:
+        raise ValueError(f"Invalid model_name: {model_name}, only Bert and GPT are supported.")
+
+    # Local training parameters
+    num_rounds = 5
+    dataset_path = f"/tmp/nvflare/dataset/nlp_ner/{num_clients}_split"
+    train_script = "src/nlp_fl.py"
+    train_args = f"--dataset_path {dataset_path} --model_name {train_model_name}"
+
+    # Define the controller workflow and send to server
+    controller = FedAvg(
+        num_clients=num_clients,
+        num_rounds=num_rounds,
+    )
+    job.to_server(controller)
+
+    # Define the initial global model and send to server
+    job.to_server(model)
+    job.to(IntimeModelSelector(key_metric="eval_acc"), "server")
+
+    # Add executor to clients
+    executor = ScriptRunner(script=train_script, script_args=train_args)
+    job.to_clients(executor)
+
+    # Export job config and run the job
+    job.export_job("/tmp/nvflare/workspace/jobs/")
+    job.simulator_run(f"/tmp/nvflare/workspace/works/{output_path}", n_clients=num_clients, gpu="0")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/prepare_data.sh b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/prepare_data.sh
new file mode 100755
index 0000000000..53eba5f43f
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/prepare_data.sh
@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+DATASET_ROOT=${1}
+echo "4-client"
+python3 code/utils/data_split.py --data_path ${DATASET_ROOT} --num_clients 4 --random_seed 0 --site_name_prefix 'site-'
\ No newline at end of file
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/requirements.txt b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/requirements.txt
new file mode 100644
index 0000000000..b7340693d5
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/requirements.txt
@@ -0,0 +1,6 @@
+torch
+torchvision
+tensorboard
+transformers
+pandas
+seqeval
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/data_sequence.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/data_sequence.py
new file mode 100644
index 0000000000..f488affc73
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/data_sequence.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+
+
+def align_label(
+    texts_encoded,
+    labels_raw,
+    labels_to_ids,
+    ignore_token,
+):
+    # generate label id vector for the network
+    # mark the tokens to be ignored
+    labels_aligned = []
+    # single sentence each time, so always use 0 index
+    # get the index mapping from token to word
+    # this can be dependent on the specific tokenizer
+    word_ids = texts_encoded.word_ids(batch_index=0)
+    previous_word_idx = None
+    for word_idx in word_ids:
+        if word_idx is None:
+            # set None the ignore tokens
+            labels_aligned.append(ignore_token)
+        elif word_idx != previous_word_idx:
+            # only label the first token of a word
+            labels_aligned.append(labels_to_ids[labels_raw[word_idx]])
+        else:
+            labels_aligned.append(ignore_token)
+        previous_word_idx = word_idx
+    return labels_aligned
+
+
+class DataSequence(torch.utils.data.Dataset):
+    def __init__(self, df, labels_to_ids, tokenizer, ignore_token=-100, max_length=150):
+        # Raw texts and corresponding labels
+        texts_batch_raw = [i.split(" ") for i in df["text"].values.tolist()]
+        labels_batch_raw = [i.split(" ") for i in df["labels"].values.tolist()]
+        # Iterate through all cases
+        self.texts = []
+        self.labels = []
+        for batch_idx in range(len(texts_batch_raw)):
+            texts_raw = texts_batch_raw[batch_idx]
+            labels_raw = labels_batch_raw[batch_idx]
+            # Encode texts with tokenizer
+            texts_encoded = tokenizer.encode_plus(
+                texts_raw,
+                padding="max_length",
+                max_length=max_length,
+                add_special_tokens=True,
+                truncation=True,
+                is_split_into_words=True,
+                return_attention_mask=True,
+                return_tensors="pt",
+            )
+            labels_aligned = align_label(texts_encoded, labels_raw, labels_to_ids, ignore_token)
+            self.texts.append(texts_encoded)
+            self.labels.append(labels_aligned)
+
+    def __len__(self):
+        return len(self.labels)
+
+    def get_batch_data(self, idx):
+        return self.texts[idx]
+
+    def get_batch_labels(self, idx):
+        return torch.LongTensor(self.labels[idx])
+
+    def __getitem__(self, idx):
+        batch_data = self.get_batch_data(idx)
+        batch_labels = self.get_batch_labels(idx)
+        return batch_data, batch_labels
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/nlp_fl.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/nlp_fl.py
new file mode 100644
index 0000000000..81eb856a3d
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/nlp_fl.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import pandas as pd
+import torch
+from data_sequence import DataSequence
+from nlp_models import BertModel, GPTModel
+from seqeval.metrics import classification_report
+from torch.optim import AdamW
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+
+# import nvflare client API
+import nvflare.client as flare
+
+# (optional) We change to use GPU to speed things up.
+# if you want to use CPU, change DEVICE="cpu"
+DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+
+def define_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dataset_path", type=str, nargs="?")
+    parser.add_argument("--batch_size", type=int, default=16, nargs="?")
+    parser.add_argument("--learning_rate", type=float, default=1e-5, nargs="?")
+    parser.add_argument("--num_workers", type=int, default=1, nargs="?")
+    parser.add_argument("--local_epochs", type=int, default=1, nargs="?")
+    parser.add_argument("--model_name", type=str, default="bert-base-uncased", nargs="?")
+    parser.add_argument("--num_labels", type=int, default=3, nargs="?")
+    parser.add_argument("--ignore_token", type=int, default=-100, nargs="?")
+    return parser.parse_args()
+
+
+def get_labels(df_train, num_labels):
+    labels = []
+    for x in df_train["labels"].values:
+        labels.extend(x.split(" "))
+    unique_labels = set(labels)
+    # check label length
+    if len(unique_labels) != num_labels:
+        raise ValueError(f"num_labels {num_labels} need to align with dataset, actual data {len(unique_labels)}!")
+    labels_to_ids = {k: v for v, k in enumerate(sorted(unique_labels))}
+    ids_to_labels = {v: k for v, k in enumerate(sorted(unique_labels))}
+    return labels_to_ids, ids_to_labels
+
+
+def main():
+    # define local parameters
+    args = define_parser()
+    dataset_path = args.dataset_path
+    batch_size = args.batch_size
+    lr = args.learning_rate
+    num_workers = args.num_workers
+    local_epochs = args.local_epochs
+    model_name = args.model_name
+    num_labels = args.num_labels
+    ignore_token = args.ignore_token
+
+    # Initializes NVFlare client API and get site_name from flare
+    flare.init()
+    site_name = flare.get_site_name()
+
+    # load data
+    df_train = pd.read_csv(os.path.join(dataset_path, site_name + "_train.csv"))
+    df_valid = pd.read_csv(os.path.join(dataset_path, site_name + "_val.csv"))
+    labels_to_ids, ids_to_labels = get_labels(df_train, num_labels)
+
+    # training components
+    writer = SummaryWriter("./")
+    if model_name == "bert-base-uncased":
+        model = BertModel(model_name=model_name, num_labels=num_labels)
+    elif model_name == "gpt2":
+        model = GPTModel(model_name=model_name, num_labels=num_labels)
+    else:
+        raise ValueError(f"Model {model_name} not supported!")
+    tokenizer = model.tokenizer
+    train_dataset = DataSequence(df_train, labels_to_ids, tokenizer=tokenizer, ignore_token=ignore_token)
+    valid_dataset = DataSequence(df_valid, labels_to_ids, tokenizer=tokenizer, ignore_token=ignore_token)
+    train_loader = DataLoader(train_dataset, num_workers=num_workers, batch_size=batch_size, shuffle=True)
+    valid_loader = DataLoader(valid_dataset, num_workers=num_workers, batch_size=batch_size, shuffle=False)
+    print(f"Training Size: {len(train_loader.dataset)}, Validation Size: {len(valid_loader.dataset)}")
+    optimizer = AdamW(model.parameters(), lr=lr)
+    local_model_file = "local_model.pt"
+    best_global_model_file = "best_global_model_file.pt"
+    best_acc = 0.0
+
+    # Train federated rounds
+    # start with global model at the beginning of each round
+    while flare.is_running():
+        # receive FLModel from NVFlare
+        global_model = flare.receive()
+        curr_round = global_model.current_round
+        epoch_global = local_epochs * curr_round
+        print(f"({site_name}) current_round={curr_round + 1}/{global_model.total_rounds}")
+
+        # load global model from NVFlare
+        model.load_state_dict(global_model.params)
+        model.to(DEVICE)
+
+        # wraps evaluation logic into a method to re-use for
+        # evaluation on both trained and received model
+        def evaluate(tb_id):
+            model.eval()
+            with torch.no_grad():
+                total_acc_val, total_loss_val, val_total = 0, 0, 0
+                val_y_pred, val_y_true = [], []
+                for val_data, val_label in valid_loader:
+                    val_label = val_label.to(DEVICE)
+                    val_total += val_label.shape[0]
+                    mask = val_data["attention_mask"].squeeze(1).to(DEVICE)
+                    input_id = val_data["input_ids"].squeeze(1).to(DEVICE)
+                    # Inference
+                    loss, logits = model(input_id, mask, val_label)
+                    # Add items for metric computation
+                    for i in range(logits.shape[0]):
+                        # remove pad tokens
+                        logits_clean = logits[i][val_label[i] != ignore_token]
+                        label_clean = val_label[i][val_label[i] != ignore_token]
+                        # calcluate acc and store prediciton and true labels
+                        predictions = logits_clean.argmax(dim=1)
+                        acc = (predictions == label_clean).float().mean()
+                        total_acc_val += acc.item()
+                        val_y_pred.append([ids_to_labels[x.item()] for x in predictions])
+                        val_y_true.append([ids_to_labels[x.item()] for x in label_clean])
+                # compute metric
+                metric_dict = classification_report(
+                    y_true=val_y_true, y_pred=val_y_pred, output_dict=True, zero_division=0
+                )
+                # tensorboard record id prefix, add to record if provided
+                writer.add_scalar(tb_id + "_precision", metric_dict["macro avg"]["precision"], epoch_global)
+                writer.add_scalar(tb_id + "_recall", metric_dict["macro avg"]["recall"], epoch_global)
+                writer.add_scalar(tb_id + "_f1-score", metric_dict["macro avg"]["f1-score"], epoch_global)
+            return metric_dict["macro avg"]["f1-score"]
+
+        # evaluate on received global model
+        val_acc = evaluate("global_val_acc")
+        if val_acc > best_acc:
+            best_acc = val_acc
+            torch.save(model.state_dict(), best_global_model_file)
+
+        # train local model
+        epoch_len = len(train_loader)
+        for epoch in range(local_epochs):
+            model.train()
+            print(f"Local epoch {site_name}: {epoch + 1}/{local_epochs} (lr={lr})")
+
+            for i, batch_data in enumerate(train_loader):
+                mask = batch_data[0]["attention_mask"].squeeze(1).to(DEVICE)
+                input_id = batch_data[0]["input_ids"].squeeze(1).to(DEVICE)
+                train_label = batch_data[1].to(DEVICE)
+                # model output
+                loss, logits = model(input_id, mask, train_label)
+                # optimize
+                optimizer.zero_grad()
+                loss.backward()
+                optimizer.step()
+                # record loss
+                current_step = epoch_len * epoch_global + i
+                writer.add_scalar("train_loss", loss.item(), current_step)
+
+        # evaluation on local trained model
+        val_acc_local = evaluate("local_val_acc")
+        torch.save(model.state_dict(), local_model_file)
+
+        # construct trained FL model
+        output_model = flare.FLModel(
+            params=model.cpu().state_dict(),
+            metrics={"eval_acc": val_acc_local},
+            meta={"NUM_STEPS_CURRENT_ROUND": epoch_len * local_epochs},
+        )
+
+        # send model back to NVFlare
+        flare.send(output_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/nlp_models.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/nlp_models.py
new file mode 100755
index 0000000000..82e97e662d
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/src/nlp_models.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+
+
+class BertModel(torch.nn.Module):
+    def __init__(self, model_name, num_labels):
+        super(BertModel, self).__init__()
+        self.num_labels = num_labels
+        self.model_name = model_name
+        self.model = AutoModelForTokenClassification.from_pretrained(
+            self.model_name, num_labels=self.num_labels, output_attentions=False, output_hidden_states=False
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+
+    def forward(self, input_id, mask, label):
+        output = self.model(input_ids=input_id, attention_mask=mask, labels=label, return_dict=False)
+        return output
+
+
+class GPTModel(torch.nn.Module):
+    def __init__(self, model_name, num_labels):
+        super(GPTModel, self).__init__()
+        self.num_labels = num_labels
+        self.model_name = model_name
+        self.model = AutoModelForTokenClassification.from_pretrained(
+            self.model_name,
+            num_labels=self.num_labels,
+            output_attentions=False,
+            output_hidden_states=False,
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, add_prefix_space=True)
+        self.tokenizer.pad_token = self.tokenizer.eos_token
+        self.model.config.pad_token_id = self.model.config.eos_token_id
+        self.model.resize_token_embeddings(len(self.tokenizer))
+
+    def forward(self, input_id, mask, label):
+        output = self.model(input_ids=input_id, attention_mask=mask, labels=label, return_dict=False)
+        return output
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/test_global_model.sh b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/test_global_model.sh
new file mode 100755
index 0000000000..49815811f2
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/test_global_model.sh
@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+DATASET_ROOT=${1}
+echo "BERT"
+python3 ner_model_test.py --model_path "/tmp/nvflare/workspace/works/Bert/server/simulate_job/app_server/" --model_name "bert-base-uncased" --data_path ${DATASET_ROOT} --num_labels 3
\ No newline at end of file
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/utils/data_split.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/utils/data_split.py
new file mode 100644
index 0000000000..7cdd0c33b7
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/code/utils/data_split.py
@@ -0,0 +1,66 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import pandas as pd
+
+
+def data_split_args_parser():
+    parser = argparse.ArgumentParser(description="Generate data split for dataset")
+    parser.add_argument("--data_path", type=str, help="Path to data file")
+    parser.add_argument("--num_clients", type=int, help="Total number of clients")
+    parser.add_argument("--random_seed", type=int, help="Random seed")
+    parser.add_argument("--site_name_prefix", type=str, default="site-", help="Site name prefix")
+    return parser
+
+
+def split_df_by_num(df, num=1):
+    df_len = df.shape[0]
+    df_1_len = num
+    idx = list(range(df_len))
+    np.random.shuffle(idx)
+    df_1 = df.iloc[idx[:df_1_len]]
+    df_2 = df.iloc[idx[df_1_len:]]
+    df_1.reset_index(drop=True, inplace=True)
+    df_2.reset_index(drop=True, inplace=True)
+    return df_1, df_2
+
+
+def main():
+    parser = data_split_args_parser()
+    args = parser.parse_args()
+    num_clients = args.num_clients
+    data_path = args.data_path
+    site_name_prefix = args.site_name_prefix
+    np.random.seed(args.random_seed)
+    for mode in ["train", "dev"]:
+        saved_name = "val" if mode == "dev" else mode
+        df = pd.read_csv(os.path.join(data_path, mode + ".csv"))
+        client_size = int(df.shape[0] / num_clients)
+        os.makedirs(f"{data_path}/{num_clients}_split", exist_ok=True)
+        for i in range(num_clients):
+            if i != num_clients - 1:
+                client_df, df = split_df_by_num(df, client_size)
+            else:
+                client_df = df
+            print(df.shape, client_df.shape)
+            # split into train, test, val
+            client_df.to_csv(f"{data_path}/{num_clients}_split/{site_name_prefix}{i + 1}_{saved_name}.csv")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/federated_nlp_with_bert.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/federated_nlp_with_bert.ipynb
new file mode 100644
index 0000000000..236c1a8243
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/federated_nlp_with_bert.ipynb
@@ -0,0 +1,234 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "59e85865-d59b-4809-9ae0-ca5260df37bc",
+   "metadata": {},
+   "source": [
+    "# Federated NLP with BERT Model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "466606e1-7f8f-45e4-bed1-c136d89258c1",
+   "metadata": {},
+   "source": [
+    "## Introduction \n",
+    "In this example, we show how to use [NVIDIA FLARE](https://nvidia.github.io/NVFlare) for a Natural Language Processing (NLP) task using [BERT](https://github.com/google-research/bert) model from [Hugging Face](https://huggingface.co/). We select [BERT-base-uncased](https://huggingface.co/bert-base-uncased) as our base model. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "022ca051-add4-474b-9637-24c16040d7b6",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "Install required packages for training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ee1d1f87-502a-4c30-aa40-f55ae65a1da7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -r code/requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "734aa5da-3e4e-4c22-a0d2-b8ee6b6be142",
+   "metadata": {},
+   "source": [
+    "## Download Data \n",
+    "The raw data can be accessed from [official page](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/). \n",
+    "In this example, we use the preprocessed csv-files from the reference repo above, which can be downloaded [here](https://drive.google.com/drive/folders/13wROtEAnMgWpLMIGHB5CY1BQ1Xe2XqhG). \n",
+    "\n",
+    "In the following, we download three files `train.csv`, `dev.csv`, and `test.csv` and save them to `/tmp/nvflare/dataset/nlp_ner`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "efd478cb-1565-4283-a4bb-87f15585932a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sh\n",
+    "mkdir -p /tmp/nvflare/dataset/nlp_ner\n",
+    "wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YWGBElsqj5ENsW0PtYwMlk_ShBt8MsLD' -O /tmp/nvflare/dataset/nlp_ner/dev.csv\n",
+    "wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=12kXGQPW-do-F7T-TLGycl0DCw6eQIaZc' -O /tmp/nvflare/dataset/nlp_ner/test.csv\n",
+    "wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1fjsf0jFKWu_-bbx236oB6e7DqOqGmw3y' -O /tmp/nvflare/dataset/nlp_ner/train.csv"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fd3f038-9b4c-416e-abeb-132625e7fefa",
+   "metadata": {},
+   "source": [
+    "## Data Preprocessing \n",
+    "We then use the preprocessed data to generate random splits for both 4-client and 2-client experiments. \n",
+    "Please modify the `DATASET_ROOT` below to point to folder containing the four downloaded csv-files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "459c5dc6-c423-4d80-9577-add2d062c326",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! code/prepare_data.sh /tmp/nvflare/dataset/nlp_ner"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f9186c7-1eb7-4eec-bfcd-ec64d12ceaf7",
+   "metadata": {},
+   "source": [
+    "The expected output is\n",
+    "```\n",
+    "4-client\n",
+    "(7594, 5) (2531, 5)\n",
+    "(5063, 5) (2531, 5)\n",
+    "(2532, 5) (2531, 5)\n",
+    "(2532, 5) (2532, 5)\n",
+    "(950, 5) (316, 5)\n",
+    "(634, 5) (316, 5)\n",
+    "(318, 5) (316, 5)\n",
+    "(318, 5) (318, 5)\n",
+    "```\n",
+    "The task here is to categorize each word in the text into three classes specified by the label. For example, the sentence \n",
+    "`Recent progress has resulted in part of the gene mutated in Duchenne and the milder Becker muscular dystrophies being cloned and has suggested that the gene itself extends over 1 , 000 to 2 , 000 kilobases ( kb ) .` into label vector `O O O O O O O O O O O B I I I I I I O O O O O O O O O O O O O O O O O O O O O O O`. `B` marks the beginning of an entity, `I` marks each entity word, and `O` represents other words.\n",
+    "Let's take a closer look at the word-label correspondence:\n",
+    "![data sample](./figs/sample.png)\n",
+    "As shown above, the task is to capture the keywords related to medical findings."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "62476dd2-97e7-48ad-908f-6df02f48f86e",
+   "metadata": {},
+   "source": [
+    "## Run automated experiments\n",
+    "We run the federated training on 4 clients for BERT model using NVFlare Simulator via [JobAPI](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html). To save time, we only run 5 rounds of fedrated training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f48f1d5f-e656-4f71-b925-94035c60ace0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%cd code\n",
+    "! python nlp_fl_job.py --model_name Bert\n",
+    "%cd .."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0683f68f-63bc-463a-bae2-f17d51fe735c",
+   "metadata": {},
+   "source": [
+    "## Results\n",
+    "### Validation curve on each site\n",
+    "In this example, each client computes their validation scores using their own\n",
+    "validation set. We recorded the loss, F1 score, precision, and recall. \n",
+    "The curves can be viewed with TensorBoard, each training for 50 epochs (50 FL rounds, 1 local epoch per round).\n",
+    "\n",
+    "For BERT model, the TensorBoard curves can be visualized:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "74a719a9-5dfb-4a8e-a540-08fb05492495",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext tensorboard\n",
+    "%tensorboard --logdir /tmp/nvflare/workspace/works/Bert/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1a42d7c-641f-45f3-92fc-367264cae669",
+   "metadata": {},
+   "source": [
+    "### Testing score\n",
+    "The testing score is computed for the global model over the testing set.\n",
+    "We provide a script for performing validation on testing data. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c142aa0-4502-4108-9b4d-462050d37a64",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%cd code\n",
+    "! sh test_global_model.sh /tmp/nvflare/dataset/nlp_ner\n",
+    "%cd .."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e32473e-2338-4176-b25d-ec7976f8440d",
+   "metadata": {},
+   "source": [
+    "The test results are:\n",
+    "```\n",
+    "BERT\n",
+    "              precision    recall  f1-score   support\n",
+    "\n",
+    "           _       0.83      0.92      0.87      1255\n",
+    "\n",
+    "   micro avg       0.83      0.92      0.87      1255\n",
+    "   macro avg       0.83      0.92      0.87      1255\n",
+    "weighted avg       0.83      0.92      0.87      1255\n",
+    "```\n",
+    "Note that training is not deterministic so the numbers can have some variations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8076c89e-39b2-44f3-903c-fc9ff8446d67",
+   "metadata": {},
+   "source": [
+    "In this section, we showed how to train a BERT model with standard Pytorch training loop. Now let's move on to the next section [LLM Supervised Fine-Tuning (SFT)](../08.2_llm_sft/LLM_SFT.ipynb) where we will see how to utilize existing Trainer scripts via HuggingFace APIs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "26111bcd-ed0a-4298-9956-b1e821409197",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/figs/sample.png b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/figs/sample.png
new file mode 100644
index 0000000000..eeef9338cb
Binary files /dev/null and b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_fed_bert/figs/sample.png differ
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_llm_p_tuning/LLM_prompt_tuning.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_llm_p_tuning/LLM_prompt_tuning.ipynb
deleted file mode 100644
index 1b04a051ae..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.1_llm_p_tuning/LLM_prompt_tuning.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f94cdcee-04ee-4a9e-8182-fe9eeb15671a",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_peft/LLM_PEFT.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_peft/LLM_PEFT.ipynb
deleted file mode 100644
index dcf490f775..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_peft/LLM_PEFT.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "78e7b1bd-f0c9-4329-9338-848df841a899",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/LLM_SFT.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/LLM_SFT.ipynb
new file mode 100644
index 0000000000..6158429c0e
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/LLM_SFT.ipynb
@@ -0,0 +1,231 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2da4b68c-e68b-4245-b5a6-fba66d3af819",
+   "metadata": {},
+   "source": [
+    "# LLM Supervised Fine-Tuning (SFT) via HuggingFace Trainer APIs\n",
+    "In this section, we illustrate how to use [NVIDIA FLARE](https://nvidia.github.io/NVFlare) for Large Language Models (LLMs) SFT task. Unlike the last section [Federated NLP with BERT Model](../08.1_fed_bert/federated_nlp_with_bert.ipynb) where we showed standard Pytorch training logic, it illustrates how to adapt a local training script with [HuggingFace](https://huggingface.co/) Trainer to NVFlare, which is widely used in LLM training.\n",
+    "\n",
+    "We show supervised fine-tuning (SFT) using the [SFT Trainer](https://huggingface.co/docs/trl/sft_trainer) from [HuggingFace](https://huggingface.co/), together with the [Llama-3.2-1B model](https://huggingface.co/meta-llama/Llama-3.2-1B) to showcase the functionality of federated SFT, allowing HuggingFace models to be trained and adapted to federated application with NVFlare. All other models from HuggingFace can be easily adapted following the same steps.\n",
+    "\n",
+    "We conducted these experiments on a single 48GB RTX 6000 Ada GPU. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b50353e-1ad9-419c-8712-187a49879978",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "To use Llama-3.2-1B model, please request access to the model here https://huggingface.co/meta-llama/Llama-3.2-1B and login with an access token using huggingface-cli. Git LFS is also necessary for downloads, please follow the steps in this [link](https://github.com/git-lfs/git-lfs/blob/main/INSTALLING.md).\n",
+    "\n",
+    "Install required packages for training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aad7de64-ce02-45c6-8718-1bd6c08c91d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -r requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a814dedb-6d93-4782-a9b5-68644b901184",
+   "metadata": {},
+   "source": [
+    "## Data Preparation\n",
+    "We use one dataset to illustrate the SFT. We download and preprocess [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2924146-d635-4e4d-bdf6-50b87a035de6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k /tmp/nvflare/dataset/llm/dolly"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "056d9797-3c18-4310-9f4e-26d345623ec9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python utils/preprocess_dolly.py --training_file /tmp/nvflare/dataset/llm/dolly/databricks-dolly-15k.jsonl --output_dir /tmp/nvflare/dataset/llm/dolly"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c77f5eff-f88c-42b9-aa61-604f68b8b5ec",
+   "metadata": {},
+   "source": [
+    "## Adaptation of Centralized Training Script to Federated\n",
+    "To illustrate the adaptation process, we use a single dataset with three training epochs. \n",
+    "### One-call training\n",
+    "Centralized trainings, as the baseline for comparison with other results, are done with the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "32ac443b-25cf-4b45-8876-41ff91bd1ee0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python utils/hf_sft_peft.py --output_path /tmp/nvflare/workspace/llm/dolly_cen_sft --train_mode SFT"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35043ab0-fdbc-4cf2-b79d-d532724d9364",
+   "metadata": {},
+   "source": [
+    "### Adaptation Step 1: iterative training\n",
+    "To adapt the centralized training script to federated application, we first need to \"break\" the single call to `trainer.train()` into iterative calls, one for each round of training.\n",
+    "For this purpose, we provided `utils/hf_sft_peft_iter.py` as an example, which is a modified version of `utils/hf_sft_peft.py`.\n",
+    "Their differences are highlighted below:\n",
+    "\n",
+    "![diff](./figs/diff.png)\n",
+    "\n",
+    "Note that the `trainer.train()` call is replaced by a `for` loop, and the three training epochs becomes three rounds, one epoch per round. \n",
+    "\n",
+    "This setting (1 epoch per round) is for simplicity of this example. In practice, we can set the number of rounds and local epoch per round according to the needs: e.g. 2 rounds with 2 epochs per round will result in 4 training epochs in total.\n",
+    "\n",
+    "At the beginning of each round, we intentionally load a fixed model weights saved at the beginning, over-writing the previous round's saved model weights, then call `trainer.train(resume_from_checkpoint=True)` with `trainer.args.num_train_epochs` incremented by 1 so that previous logging results are not overwritten. \n",
+    "\n",
+    "The purpose of doing so is to tell if the intended weights are succesfully loaded at each round. Without using a fixed starting model, even if the model weights are not properly loaded, the training loss curve will still follow the one-call result, which is not what we want to see. \n",
+    "\n",
+    "If the intended model weights (serving as the starting point for each round, the \"global model\" for FL use case) is properly loaded, then we shall observe a \"zig-zag\" pattern in the training loss curve. This is because the model weights are reset to the same starting point at the beginning of each round, in contrast to the one-shot centralized training, where the model weights are updated continuously, and the training loss curve should follow an overall decreasing trend.\n",
+    "\n",
+    "To run iterative training, we use the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99d9a8ba-4540-43d5-9cec-e867acc4d300",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python utils/hf_sft_peft_iter.py --output_path /tmp/nvflare/workspace/llm/dolly_cen_sft_iter --train_mode SFT"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0671750e-ea31-4310-974a-ecadb380ff49",
+   "metadata": {},
+   "source": [
+    "We can observe the SFT curves with tensorboard shown below. As expected, we can see the \"zig-zag\" pattern in the iterative training loss curve."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa413644-630c-4c4b-be72-828202b2a29b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext tensorboard\n",
+    "%tensorboard --logdir /tmp/nvflare/workspace/llm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7f97580-cc0e-4389-b28a-0b3b07de38a1",
+   "metadata": {},
+   "source": [
+    "### Adaptation Step 2: federated with NVFlare\n",
+    "Once we have the iterative training script ready with \"starting model\" loading capability, it can be easily adapted to a NVFlare trainer by using [Client API](../../hello-world/ml-to-fl/pt/README.md).\n",
+    "\n",
+    "The major code modifications are for receiving and returning the global model (replacing the constant one used by iterative training), as shown below:\n",
+    "\n",
+    "![diff](./figs/diff_fl_1.png)\n",
+    "![diff](./figs/diff_fl_2.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7d3c82c6-d6d7-4773-8b00-132510c68adc",
+   "metadata": {},
+   "source": [
+    "### Federated Training Results\n",
+    "We run the federated training on a single client using NVFlare Simulator via [JobAPI](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5f5f3351-b1ca-4da3-8416-3e8ad1a75dc6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python sft_job.py --client_ids dolly --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/dolly_fl_sft --job_dir /tmp/nvflare/workspace/jobs/llm_hf_sft --train_mode SFT "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6eac7cf-f780-4426-b136-eee693b9b485",
+   "metadata": {},
+   "source": [
+    "The SFT curves are shown below. With some training randomness, the two SFT training loss curves (centralized v.s. federated) align with each other. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8b3ecf1f-2729-4183-9aa0-23b91645cab6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext tensorboard\n",
+    "%tensorboard --logdir /tmp/nvflare/workspace/llm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0181a317-07a7-4001-bdd3-894c40a1f293",
+   "metadata": {},
+   "source": [
+    "Now let's move on to the next section of [LLM Parameter-Efficient Fine-Tuning (PEFT)](../08.3_llm_peft/LLM_PEFT.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "762d64f6-cc13-405b-aea5-90cce58a0171",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff.png b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff.png
new file mode 100644
index 0000000000..e5d1b5f980
Binary files /dev/null and b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff.png differ
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff_fl_1.png b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff_fl_1.png
new file mode 100644
index 0000000000..4de6721e17
Binary files /dev/null and b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff_fl_1.png differ
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff_fl_2.png b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff_fl_2.png
new file mode 100644
index 0000000000..45499b307c
Binary files /dev/null and b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/figs/diff_fl_2.png differ
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/requirements.txt b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/requirements.txt
new file mode 100644
index 0000000000..4a58cb756e
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/requirements.txt
@@ -0,0 +1,7 @@
+torch==2.5.1
+datasets
+tensorboard
+transformers==4.48.0
+peft==0.14.0
+trl==0.13.0
+bitsandbytes
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/sft_job.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/sft_job.py
new file mode 100644
index 0000000000..ad73a4da8c
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/sft_job.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from nvflare import FedJob, FilterType
+from nvflare.app_common.widgets.intime_model_selector import IntimeModelSelector
+from nvflare.app_common.workflows.fedavg import FedAvg
+from nvflare.app_opt.pt.file_model_persistor import PTFileModelPersistor
+from nvflare.app_opt.pt.quantization.dequantizor import ModelDequantizor
+from nvflare.app_opt.pt.quantization.quantizor import ModelQuantizor
+from nvflare.job_config.script_runner import ScriptRunner
+
+
+def main():
+    args = define_parser()
+    train_script = "src/hf_sft_peft_fl.py"
+    client_ids = args.client_ids
+    num_clients = len(client_ids)
+
+    if args.threads:
+        num_threads = args.threads
+    else:
+        num_threads = num_clients
+
+    if num_threads < num_clients:
+        print("The number of threads smaller than the number of clients, runner clean-up will be performed.")
+        clean_up = 1
+    else:
+        clean_up = 0
+
+    num_rounds = args.num_rounds
+    workspace_dir = args.workspace_dir
+    job_dir = args.job_dir
+    model_name_or_path = args.model_name_or_path
+    train_mode = args.train_mode
+    message_mode = args.message_mode
+
+    # Create the FedJob
+    if train_mode.lower() == "sft":
+        job = FedJob(name="llm_hf_sft", min_clients=num_clients)
+        output_path = "sft"
+    elif train_mode.lower() == "peft":
+        job = FedJob(name="llm_hf_peft", min_clients=num_clients)
+        output_path = "peft"
+    else:
+        raise ValueError(f"Invalid train_mode: {train_mode}, only SFT and PEFT are supported.")
+
+    # Define the FedAvg controller workflow and send to server
+    controller = FedAvg(
+        num_clients=num_clients,
+        num_rounds=num_rounds,
+    )
+    job.to(controller, "server")
+
+    if args.quantize_mode:
+        # If using quantization, add quantize filters.
+        quantizor = ModelQuantizor(quantization_type=args.quantize_mode)
+        dequantizor = ModelDequantizor()
+        job.to(quantizor, "server", tasks=["train"], filter_type=FilterType.TASK_DATA)
+        job.to(dequantizor, "server", tasks=["train"], filter_type=FilterType.TASK_RESULT)
+
+    # Define the model persistor and send to server
+    # First send the model to the server
+    job.to("src/hf_sft_model.py", "server")
+    # Then send the model persistor to the server
+    model_args = {"path": "src.hf_sft_model.CausalLMModel", "args": {"model_name_or_path": model_name_or_path}}
+    job.to(PTFileModelPersistor(model=model_args), "server", id="persistor")
+
+    # Add model selection widget and send to server
+    job.to(IntimeModelSelector(key_metric="eval_loss", negate_key_metric=True), "server", id="model_selector")
+
+    # Send ScriptRunner to all clients
+    for i in range(num_clients):
+        client_id = client_ids[i]
+        site_name = f"site-{client_id}"
+        data_path_train = os.path.join(args.data_path, client_id, "training.jsonl")
+        data_path_valid = os.path.join(args.data_path, client_id, "validation.jsonl")
+
+        script_args = f"--model_name_or_path {model_name_or_path} --data_path_train {data_path_train} --data_path_valid {data_path_valid} --output_path {output_path} --train_mode {train_mode} --message_mode {message_mode} --clean_up {clean_up}"
+        if message_mode == "tensor":
+            params_exchange_format = "pytorch"
+        elif message_mode == "numpy":
+            params_exchange_format = "numpy"
+        else:
+            raise ValueError(f"Invalid message_mode: {message_mode}, only numpy and tensor are supported.")
+
+        runner = ScriptRunner(
+            script=train_script,
+            script_args=script_args,
+            params_exchange_format=params_exchange_format,
+            launch_external_process=False,
+        )
+        job.to(runner, site_name, tasks=["train"])
+
+        if args.quantize_mode:
+            job.to(quantizor, site_name, tasks=["train"], filter_type=FilterType.TASK_RESULT)
+            job.to(dequantizor, site_name, tasks=["train"], filter_type=FilterType.TASK_DATA)
+
+    # Export the job
+    print("job_dir=", job_dir)
+    job.export_job(job_dir)
+
+    # Run the job
+    print("workspace_dir=", workspace_dir)
+    print("num_threads=", num_threads)
+    job.simulator_run(workspace_dir, threads=num_threads, gpu=args.gpu)
+
+
+def define_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--client_ids",
+        nargs="+",
+        type=str,
+        default="",
+        help="Clinet IDs, used to get the data path for each client",
+    )
+    parser.add_argument(
+        "--num_rounds",
+        type=int,
+        default=3,
+        help="Number of rounds, default to 3",
+    )
+    parser.add_argument(
+        "--workspace_dir",
+        type=str,
+        default="/tmp/nvflare/jobs/llm_hf/workdir",
+        help="work directory, default to '/tmp/nvflare/jobs/llm_hf/workdir'",
+    )
+    parser.add_argument(
+        "--job_dir",
+        type=str,
+        default="/tmp/nvflare/jobs/llm_hf/jobdir",
+        help="directory for job export, default to '/tmp/nvflare/jobs/llm_hf/jobdir'",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+        help="model name or path",
+    )
+    parser.add_argument(
+        "--data_path",
+        type=str,
+        default="",
+        help="root directory for training and validation data",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    parser.add_argument(
+        "--quantize_mode",
+        type=str,
+        default=None,
+        help="quantization mode, default to None (no quantization)",
+    )
+    parser.add_argument(
+        "--message_mode",
+        type=str,
+        default="numpy",
+        help="message mode, numpy or tensor, default to numpy",
+    )
+    parser.add_argument(
+        "--threads",
+        type=int,
+        help="number of threads to use for FL simulation, default to the number of clients",
+    )
+    parser.add_argument(
+        "--gpu",
+        type=str,
+        default="0",
+        help="gpu assignments for simulating clients, comma separated, default to single gpu",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/src/hf_sft_model.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/src/hf_sft_model.py
new file mode 100755
index 0000000000..fd84c3f06a
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/src/hf_sft_model.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+from transformers import AutoModelForCausalLM
+
+
+class CausalLMModel(torch.nn.Module):
+    def __init__(self, model_name_or_path):
+        super(CausalLMModel, self).__init__()
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_name_or_path,
+        )
+
+    def forward(self, input_id):
+        output = self.model(input_ids=input_id, return_dict=False)
+        return output
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/src/hf_sft_peft_fl.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/src/hf_sft_peft_fl.py
new file mode 100755
index 0000000000..96667151bc
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/src/hf_sft_peft_fl.py
@@ -0,0 +1,250 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import os
+
+# Add deterministic seed for reproducibility illustration
+import random
+
+import datasets
+import numpy as np
+import torch
+from peft import LoraConfig, get_peft_model, get_peft_model_state_dict, set_peft_model_state_dict, utils
+from transformers import AutoModelForCausalLM, trainer_utils
+from trl import SFTConfig, SFTTrainer
+
+import nvflare.client as flare
+
+torch.manual_seed(0)
+random.seed(0)
+np.random.seed(0)
+
+
+def format_instruction(example):
+    output_texts = []
+    for i in range(len(example["input"])):
+        text = f"### Instruction: Generate Output according to the information and question given by Input. ### Input:{example['input'][i]} ### Response: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+    )
+    parser.add_argument(
+        "--data_path_train",
+        type=str,
+        default="./dataset/dolly/training.jsonl",
+    )
+    parser.add_argument(
+        "--data_path_valid",
+        type=str,
+        default="./dataset/dolly/validation.jsonl",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./workspace_federated/llama-3.2-1b-dolly-sft",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    parser.add_argument(
+        "--message_mode",
+        type=str,
+        default="numpy",
+        help="message mode, numpy or tensor, default to numpy",
+    )
+    parser.add_argument("--local_epoch", type=int, default=1)
+    parser.add_argument("--clean_up", type=int, default=0)
+    args = parser.parse_args()
+
+    # Dataset
+    dataset_train = datasets.load_dataset("json", data_files=args.data_path_train, split="train")
+    dataset_valid = datasets.load_dataset("json", data_files=args.data_path_valid, split="train")
+    # Print dataset info
+    print(f"Dataset size: training {len(dataset_train)}, validation {len(dataset_valid)}")
+    # record every 5% of the dataset
+    batch_size = 4
+    gra_accu_steps = 10
+    logging_steps = int(len(dataset_train) / (20 * batch_size * gra_accu_steps))
+    print(f"logging_steps: {logging_steps}")
+
+    # Model configs
+    model_name_or_path = args.model_name_or_path
+    peft_config = None
+
+    # Load model
+    default_dtype = torch.get_default_dtype()
+    torch.set_default_dtype(torch.bfloat16)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path,
+        device_map="auto",
+        use_cache=False,
+        torch_dtype=torch.bfloat16,
+    )
+    torch.set_default_dtype(default_dtype)
+
+    # Train mode
+    if args.train_mode.lower() == "sft":
+        train_mode = 0
+    elif args.train_mode.lower() == "peft":
+        train_mode = 1
+    else:
+        raise ValueError(f"Invalid train_mode: {args.train_mode}, only SFT and PEFT are supported.")
+
+    # PEFT specific
+    if train_mode:
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        model = get_peft_model(model, peft_config)
+    model.config.pretraining_tp = 1
+
+    # Training arguments
+    train_args = SFTConfig(
+        output_dir=args.output_path,
+        num_train_epochs=args.local_epoch,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=gra_accu_steps,
+        gradient_checkpointing=False,
+        optim="paged_adamw_32bit",
+        logging_steps=logging_steps,
+        save_strategy="epoch",
+        learning_rate=5e-4,
+        bf16=True,
+        max_grad_norm=0.3,
+        warmup_ratio=0.03,
+        lr_scheduler_type="constant",
+        disable_tqdm=True,
+        max_seq_length=1024,
+        save_total_limit=2,
+        # safetensors has some issues in saving lm_head.weight, disable it for now
+        save_safetensors=False,
+    )
+
+    # Trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset_train,
+        eval_dataset=dataset_valid,
+        peft_config=peft_config,
+        formatting_func=format_instruction,
+        args=train_args,
+    )
+
+    # initializes NVFlare client API
+    flare.init()
+
+    # Train federated rounds
+    # start with global model at the beginning of each round
+    while flare.is_running():
+        # receives FLModel from NVFlare
+        input_model = flare.receive()
+        curr_round = input_model.current_round
+        print(f"current_round={curr_round}")
+
+        # Update the key name received from global model if using model def file
+        global_model = copy.deepcopy(input_model.params)
+        for key in list(global_model.keys()):
+            global_model[key.replace("model.", "", 1)] = global_model.pop(key)
+
+        # wraps evaluation logic into a method to re-use for
+        # evaluation on both trained and received model
+        def evaluate(input_weights, mode):
+            # Special load func for PEFT
+            if train_mode:
+                set_peft_model_state_dict(trainer.model, input_weights)
+            else:
+                trainer.model.load_state_dict(input_weights)
+            metric_score = trainer.evaluate()
+            print(f"Evaluation metric score: {metric_score}")
+            return metric_score
+
+        # evaluate on received global model
+        eval_loss = evaluate(global_model, train_mode)
+        eval_loss = float(eval_loss["eval_loss"])
+
+        # Load global model and previous training states
+        # Since we perform iterative training by using "resume" functionality
+        # we need to replace the resume weights with global weights every round
+        if curr_round == 0:
+            # First round, start from pretrained model
+            trainer.train()
+        else:
+            # replace local resume weights with global weights
+            resume_from_checkpoint_folder = trainer_utils.get_last_checkpoint(trainer.args.output_dir)
+            if train_mode:
+                # PEFT model small, directly save via torch.save
+                resume_model_file_path = os.path.join(resume_from_checkpoint_folder, utils.WEIGHTS_NAME)
+                torch.save(global_model, resume_model_file_path)
+            else:
+                # SFT model can be large, save via HF API
+                # Disable safetensor for now
+                trainer.model.save_pretrained(resume_from_checkpoint_folder, safe_serialization=False)
+            # increment num_train_epochs so that the trainer will continue training
+            if args.clean_up:
+                # runner got cleaned up, set num_train_epochs with curr_round
+                trainer.args.num_train_epochs = (curr_round + 1) * args.local_epoch
+            else:
+                # runner still alive, increment num_train_epochs with local_epoch
+                trainer.args.num_train_epochs += args.local_epoch
+            print(f"Increment num_train_epochs to {trainer.args.num_train_epochs}")
+            # continue training
+            trainer.train(resume_from_checkpoint=True)
+
+        # compose output model to send back to server
+        if train_mode:
+            # PEFT, load PEFT part from trainer model
+            out_param = get_peft_model_state_dict(trainer.model)
+        else:
+            # SFT, load whole model state_dict
+            out_param = trainer.model.state_dict()
+
+        # update the key name sent to global model
+        if not train_mode:
+            for key in list(out_param.keys()):
+                out_param["model." + key] = out_param.pop(key).cpu()
+
+        if args.message_mode.lower() == "numpy":
+            # cast out_param to float32 preparing for communication with numpy
+            # otherwise do nothing
+            out_param = {k: v.to(torch.float32) for k, v in out_param.items()}
+
+        # construct trained FL model
+        output_model = flare.FLModel(
+            params=out_param,
+            metrics={"eval_loss": eval_loss},
+            meta={"NUM_STEPS_CURRENT_ROUND": trainer.train_dataset.num_rows},
+        )
+        # send model back to NVFlare
+        flare.send(output_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/hf_sft_peft.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/hf_sft_peft.py
new file mode 100755
index 0000000000..1fc7d53826
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/hf_sft_peft.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+# Add deterministic seed for reproducibility illustration
+import random
+
+import datasets
+import numpy as np
+import torch
+from peft import LoraConfig, get_peft_model
+from transformers import AutoModelForCausalLM
+from trl import SFTConfig, SFTTrainer
+
+torch.manual_seed(0)
+random.seed(0)
+np.random.seed(0)
+
+
+def format_instruction(example):
+    output_texts = []
+    for i in range(len(example["input"])):
+        text = f"### Instruction: Generate Output according to the information and question given by Input. ### Input:{example['input'][i]} ### Response: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+    )
+    parser.add_argument(
+        "--data_path_train",
+        type=str,
+        default="/tmp/nvflare/dataset/llm/dolly/training.jsonl",
+    )
+    parser.add_argument(
+        "--data_path_valid",
+        type=str,
+        default="/tmp/nvflare/dataset/llm/dolly/validation.jsonl",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./workspace_centralized/llama-3.2-1b-dolly-sft",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    args = parser.parse_args()
+
+    # Dataset
+    dataset_train = datasets.load_dataset("json", data_files=args.data_path_train, split="train")
+    dataset_valid = datasets.load_dataset("json", data_files=args.data_path_valid, split="train")
+    # Print dataset info
+    print(f"Dataset size: training {len(dataset_train)}, validation {len(dataset_valid)}")
+    # record every 5% of the dataset
+    batch_size = 4
+    gra_accu_steps = 10
+    logging_steps = int(len(dataset_train) / (20 * batch_size * gra_accu_steps))
+    print(f"logging_steps: {logging_steps}")
+
+    # Model configs
+    model_name_or_path = args.model_name_or_path
+    peft_config = None
+
+    # Load model
+    default_dtype = torch.get_default_dtype()
+    torch.set_default_dtype(torch.bfloat16)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path,
+        device_map="auto",
+        use_cache=False,
+        torch_dtype=torch.bfloat16,
+    )
+    torch.set_default_dtype(default_dtype)
+
+    # Train mode
+    if args.train_mode.lower() == "sft":
+        train_mode = 0
+    elif args.train_mode.lower() == "peft":
+        train_mode = 1
+    else:
+        raise ValueError(f"Invalid train_mode: {args.train_mode}, only SFT and PEFT are supported.")
+
+    # PEFT specific
+    if train_mode:
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        model = get_peft_model(model, peft_config)
+    model.config.pretraining_tp = 1
+
+    # Training arguments
+    train_args = SFTConfig(
+        output_dir=args.output_path,
+        num_train_epochs=3,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=gra_accu_steps,
+        gradient_checkpointing=False,
+        optim="paged_adamw_32bit",
+        logging_steps=logging_steps,
+        save_strategy="epoch",
+        learning_rate=5e-4,
+        bf16=True,
+        max_grad_norm=0.3,
+        warmup_ratio=0.03,
+        lr_scheduler_type="constant",
+        disable_tqdm=True,
+        max_seq_length=1024,
+    )
+
+    # Trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset_train,
+        eval_dataset=dataset_valid,
+        peft_config=peft_config,
+        formatting_func=format_instruction,
+        args=train_args,
+    )
+
+    # Evaluate
+    trainer.evaluate()
+
+    # Train
+    trainer.train()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/hf_sft_peft_iter.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/hf_sft_peft_iter.py
new file mode 100755
index 0000000000..5584b63a88
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/hf_sft_peft_iter.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+# Add deterministic seed for reproducibility illustration
+import random
+
+import datasets
+import numpy as np
+import torch
+from peft import LoraConfig, get_peft_model, get_peft_model_state_dict, set_peft_model_state_dict, utils
+from transformers import AutoModelForCausalLM, trainer_utils
+from trl import SFTConfig, SFTTrainer
+
+torch.manual_seed(0)
+random.seed(0)
+np.random.seed(0)
+
+
+def format_instruction(example):
+    output_texts = []
+    for i in range(len(example["input"])):
+        text = f"### Instruction: Generate Output according to the information and question given by Input. ### Input:{example['input'][i]} ### Response: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+    )
+    parser.add_argument(
+        "--data_path_train",
+        type=str,
+        default="/tmp/nvflare/dataset/llm/dolly/training.jsonl",
+    )
+    parser.add_argument(
+        "--data_path_valid",
+        type=str,
+        default="/tmp/nvflare/dataset/llm/dolly/validation.jsonl",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./workspace_centralized/llama-3.2-1b-dolly-sft-iter",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    args = parser.parse_args()
+
+    # Dataset
+    dataset_train = datasets.load_dataset("json", data_files=args.data_path_train, split="train")
+    dataset_valid = datasets.load_dataset("json", data_files=args.data_path_valid, split="train")
+    # Print dataset info
+    print(f"Dataset size: training {len(dataset_train)}, validation {len(dataset_valid)}")
+    # record every 5% of the dataset
+    batch_size = 4
+    gra_accu_steps = 10
+    logging_steps = int(len(dataset_train) / (20 * batch_size * gra_accu_steps))
+    print(f"logging_steps: {logging_steps}")
+
+    # Model configs
+    model_name_or_path = args.model_name_or_path
+    peft_config = None
+
+    # Load model
+    default_dtype = torch.get_default_dtype()
+    torch.set_default_dtype(torch.bfloat16)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path,
+        device_map="auto",
+        use_cache=False,
+        torch_dtype=torch.bfloat16,
+    )
+    torch.set_default_dtype(default_dtype)
+
+    # Train mode
+    if args.train_mode.lower() == "sft":
+        train_mode = 0
+    elif args.train_mode.lower() == "peft":
+        train_mode = 1
+    else:
+        raise ValueError(f"Invalid train_mode: {args.train_mode}, only SFT and PEFT are supported.")
+
+    # PEFT specific
+    if train_mode:
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        model = get_peft_model(model, peft_config)
+    model.config.pretraining_tp = 1
+
+    # Training arguments
+    train_args = SFTConfig(
+        output_dir=args.output_path,
+        num_train_epochs=1,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=gra_accu_steps,
+        gradient_checkpointing=False,
+        optim="paged_adamw_32bit",
+        logging_steps=logging_steps,
+        save_strategy="epoch",
+        learning_rate=5e-4,
+        bf16=True,
+        max_grad_norm=0.3,
+        warmup_ratio=0.03,
+        lr_scheduler_type="constant",
+        disable_tqdm=True,
+        max_seq_length=1024,
+        # safetensors has some issues in saving lm_head.weight, disable it for now
+        save_safetensors=False,
+    )
+
+    # Trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset_train,
+        eval_dataset=dataset_valid,
+        peft_config=peft_config,
+        formatting_func=format_instruction,
+        args=train_args,
+    )
+
+    # Save base model state_dict, which will be used as the starting
+    # weights for each round - to show the weights are loaded correctly
+    initial_model_path = os.path.join(args.output_path, "model_dict_base.pt")
+    if train_mode:
+        params = get_peft_model_state_dict(model)
+    else:
+        params = model.state_dict()
+    torch.save(params, initial_model_path)
+
+    # Train iteratively by using "resume" functionality
+    # and replace the resume weights every round
+    for curr_round in range(3):
+        print(f"current_round={curr_round}")
+
+        # Load and Evaluate model file
+        state_dict_replace = torch.load(initial_model_path, map_location="cpu", weights_only=True)
+        if train_mode:
+            set_peft_model_state_dict(trainer.model, state_dict_replace)
+        else:
+            trainer.model.load_state_dict(state_dict_replace)
+        trainer.evaluate()
+
+        # Train
+        if curr_round == 0:
+            # First round, start from pretrained model
+            trainer.train()
+        else:
+            # replace local resume weights with global weights
+            resume_from_checkpoint_folder = trainer_utils.get_last_checkpoint(trainer.args.output_dir)
+            if train_mode:
+                # PEFT model small, directly save via torch.save
+                resume_model_file_path = os.path.join(resume_from_checkpoint_folder, utils.WEIGHTS_NAME)
+                torch.save(state_dict_replace, resume_model_file_path)
+            else:
+                # SFT model can be large, save via HF API
+                # Disable safetensor for now
+                trainer.model.save_pretrained(resume_from_checkpoint_folder, safe_serialization=False)
+            # increment num_train_epochs so that the trainer will continue training
+            trainer.args.num_train_epochs += 1
+            # continue training
+            trainer.train(resume_from_checkpoint=True)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/preprocess_dolly.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/preprocess_dolly.py
new file mode 100755
index 0000000000..5c02ba9506
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.2_llm_sft/utils/preprocess_dolly.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import numpy as np
+import pandas as pd
+
+
+def data_args():
+    parser = argparse.ArgumentParser(description="Preprocess data to train and validation files in jsonl format")
+    parser.add_argument("--training_file", type=str, required=True, help="Path to training set")
+    parser.add_argument("--validation_file", type=str, help="Path to validation set, if given, append to training data")
+    parser.add_argument("--validation_ratio", type=float, default=0.1, help="Ratio of validation set, defult to 10%")
+    parser.add_argument("--testing_ratio", type=float, default=0.1, help="Ratio of testing set, defult to 10%")
+    parser.add_argument("--output_dir", type=str, required=True, help="Path to output folder")
+    args = parser.parse_args()
+    return args
+
+
+def split_to_jsonl(data, output_dir, validation_ratio, testing_ratio):
+    print("Preprocessing data to NeMo_SFT jsonl format...")
+    output_path_tra = os.path.join(output_dir, "training.jsonl")
+    output_path_val = os.path.join(output_dir, "validation.jsonl")
+    output_path_tst = os.path.join(output_dir, "testing.jsonl")
+
+    data_ct = len(data)
+    val_threshold = int(data_ct * validation_ratio)
+    test_threshold = int(data_ct * testing_ratio)
+
+    with open(output_path_val, "w") as g, open(output_path_tst, "w") as h, open(output_path_tra, "w") as i:
+        for index, item in data.iterrows():
+            context = item["context"].strip()
+            if context != "":
+                # Randomize context and instruction order.
+                context_first = np.random.randint(0, 2) == 0
+                if context_first:
+                    instruction = item["instruction"].strip()
+                    assert instruction != ""
+                    input = f"{context}\n\n{instruction}"
+                    output = item["response"]
+                else:
+                    instruction = item["instruction"].strip()
+                    assert instruction != ""
+                    input = f"{instruction}\n\n{context}"
+                    output = item["response"]
+            else:
+                input = item["instruction"]
+                output = item["response"]
+            # write to jsonl file according to index
+            if index < val_threshold:
+                h.write(json.dumps({"input": input, "output": output}) + "\n")
+            elif index < val_threshold + test_threshold:
+                g.write(json.dumps({"input": input, "output": output}) + "\n")
+            else:
+                i.write(json.dumps({"input": input, "output": output}) + "\n")
+    print(f"{index + 1} out of {data_ct} Data was successfully preprocessed and saved.")
+
+
+def main():
+    args = data_args()
+    # load training data
+    path_to_train = args.training_file
+    train = pd.read_json(path_to_train, lines=True)
+    # load validation data if provided and append to training data
+    if args.validation_file:
+        path_to_val = args.validation_file
+        val = pd.read_json(path_to_val, lines=True)
+        train = pd.concat([train, val])
+    # randomize the order of the data
+    data_full = train.sample(frac=1, random_state=0).reset_index(drop=True)
+    # split data into training, validation and testing
+    val_ratio = args.validation_ratio
+    test_ratio = args.testing_ratio
+    output_dir = args.output_dir
+    split_to_jsonl(data_full, output_dir, val_ratio, test_ratio)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/LLM_PEFT.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/LLM_PEFT.ipynb
new file mode 100644
index 0000000000..acdbef0911
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/LLM_PEFT.ipynb
@@ -0,0 +1,162 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4ef8a52f-f0bd-493c-ac70-32d5f7e5b87e",
+   "metadata": {},
+   "source": [
+    "# LLM Parameter-Efficient Fine-Tuning (PEFT) via HuggingFace Trainer APIs\n",
+    "Similar to last section [LLM Supervised Fine-Tuning (SFT)](../08.2_llm_sft/LLM_SFT.ipynb), in this section, we illustrate how to use [NVIDIA FLARE](https://nvidia.github.io/NVFlare) for Large Language Models (LLMs) PEFT task with [HuggingFace](https://huggingface.co/) Trainer APIs with [PEFT library](https://github.com/huggingface/peft).\n",
+    "\n",
+    "We use the same model of the [Llama-3.2-1B model](https://huggingface.co/meta-llama/Llama-3.2-1B) to showcase the functionality of federated PEFT. For PEFT, we used LoRA method, other PEFT methods (e.g. p-tuning, prompt-tuning) can be easily adapted as well by modifying the configs following [PEFT](https://github.com/huggingface/peft) examples.\n",
+    "\n",
+    "We conducted these experiments on a single 48GB RTX 6000 Ada GPU. \n",
+    "\n",
+    "To use Llama-3.2-1B model, please request access to the model here https://huggingface.co/meta-llama/Llama-3.2-1B and login with an access token using huggingface-cli.\n",
+    "\n",
+    "## Setup\n",
+    "Git LFS is also necessary for downloads, please follow the steps in this [link](https://github.com/git-lfs/git-lfs/blob/main/INSTALLING.md).\n",
+    "\n",
+    "Install required packages for training:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "32803fa2-ac9f-4e9e-b5d0-5ad5bac52bd8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -r requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8437dc31-3073-4502-af79-7b0e981312a6",
+   "metadata": {},
+   "source": [
+    "## Data Preparation\n",
+    "In this example, we use two datasets to illustrate the PEFT training.\n",
+    "\n",
+    "We download and preprocess three data sets: [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k), and [Oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb6c216f-057f-4225-bcec-1839e6139c4d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k /tmp/nvflare/dataset/llm/dolly\n",
+    "! git clone https://huggingface.co/datasets/OpenAssistant/oasst1 /tmp/nvflare/dataset/llm/oasst1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6ec5c533-b8a5-4380-be49-2eaf717c1712",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python utils/preprocess_dolly.py --training_file /tmp/nvflare/dataset/llm/dolly/databricks-dolly-15k.jsonl --output_dir /tmp/nvflare/dataset/llm/dolly\n",
+    "! python utils/preprocess_oasst1.py --training_file /tmp/nvflare/dataset/llm/oasst1/data/train-00000-of-00001-b42a775f407cee45.parquet --validation_file /tmp/nvflare/dataset/llm/oasst1/data/validation-00000-of-00001-134b8fd0c89408b6.parquet --output_dir /tmp/nvflare/dataset/llm/oasst1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c258adcb-1de4-4ecf-a733-7a91b9ab1dd7",
+   "metadata": {},
+   "source": [
+    "## Centralized Baseline\n",
+    "We run three centralized baselines as"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "acfcc12d-999d-494f-9c7b-bb747975b4f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python utils/hf_sft_peft.py --output_path /tmp/nvflare/workspace/llm/dolly_cen_peft --train_mode PEFT\n",
+    "! python utils/hf_sft_peft.py --data_path_train /tmp/nvflare/dataset/llm/oasst1/training.jsonl --data_path_valid /tmp/nvflare/dataset/llm/oasst1/validation.jsonl --output_path /tmp/nvflare/workspace/llm/oasst_cen_peft --train_mode PEFT"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f55dd56b-2ca1-4c80-8b7c-68334079528c",
+   "metadata": {},
+   "source": [
+    "### Federated Training Results\n",
+    "We run the federated training on a single client using NVFlare Simulator via [JobAPI](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "75cc3483-311a-4858-bbb2-c3f236bb5421",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python peft_job.py --client_ids dolly oasst1 --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/all_fl_peft --job_dir /tmp/nvflare/workspace/jobs/llm_fl_peft --train_mode PEFT --threads 2 "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0da8848c-b1e6-4090-adfe-8398e144cd14",
+   "metadata": {},
+   "source": [
+    "The SFT curves are shown below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07c0f887-8e43-4206-9e76-101c060854fc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext tensorboard\n",
+    "%tensorboard --logdir /tmp/nvflare/workspace/llm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b91d614e-2f12-45dc-9997-825dd696f421",
+   "metadata": {},
+   "source": [
+    "With SFT and PEFT examples, now let's move on to the next section of [LLM Quantization](../08.4_llm_quantization/LLM_quantization.ipynb), where we will see how to make the message transmission more efficient."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5f05d5a6-df1c-4e23-a928-c0542bca05b9",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/peft_job.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/peft_job.py
new file mode 100644
index 0000000000..5e6fd99f4e
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/peft_job.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from nvflare import FedJob, FilterType
+from nvflare.app_common.widgets.intime_model_selector import IntimeModelSelector
+from nvflare.app_common.workflows.fedavg import FedAvg
+from nvflare.app_opt.pt.file_model_persistor import PTFileModelPersistor
+from nvflare.app_opt.pt.quantization.dequantizor import ModelDequantizor
+from nvflare.app_opt.pt.quantization.quantizor import ModelQuantizor
+from nvflare.job_config.script_runner import ScriptRunner
+
+
+def main():
+    args = define_parser()
+    train_script = "src/hf_sft_peft_fl.py"
+    client_ids = args.client_ids
+    num_clients = len(client_ids)
+
+    if args.threads:
+        num_threads = args.threads
+    else:
+        num_threads = num_clients
+
+    if num_threads < num_clients:
+        print("The number of threads smaller than the number of clients, runner clean-up will be performed.")
+        clean_up = 1
+    else:
+        clean_up = 0
+
+    num_rounds = args.num_rounds
+    workspace_dir = args.workspace_dir
+    job_dir = args.job_dir
+    model_name_or_path = args.model_name_or_path
+    train_mode = args.train_mode
+    message_mode = args.message_mode
+
+    # Create the FedJob
+    if train_mode.lower() == "sft":
+        job = FedJob(name="llm_hf_sft", min_clients=num_clients)
+        output_path = "sft"
+    elif train_mode.lower() == "peft":
+        job = FedJob(name="llm_hf_peft", min_clients=num_clients)
+        output_path = "peft"
+    else:
+        raise ValueError(f"Invalid train_mode: {train_mode}, only SFT and PEFT are supported.")
+
+    # Define the FedAvg controller workflow and send to server
+    controller = FedAvg(
+        num_clients=num_clients,
+        num_rounds=num_rounds,
+    )
+    job.to(controller, "server")
+
+    if args.quantize_mode:
+        # If using quantization, add quantize filters.
+        quantizor = ModelQuantizor(quantization_type=args.quantize_mode)
+        dequantizor = ModelDequantizor()
+        job.to(quantizor, "server", tasks=["train"], filter_type=FilterType.TASK_DATA)
+        job.to(dequantizor, "server", tasks=["train"], filter_type=FilterType.TASK_RESULT)
+
+    # Define the model persistor and send to server
+    # First send the model to the server
+    job.to("src/hf_peft_model.py", "server")
+    # Then send the model persistor to the server
+    model_args = {"path": "src.hf_peft_model.CausalLMPEFTModel", "args": {"model_name_or_path": model_name_or_path}}
+    job.to(PTFileModelPersistor(model=model_args), "server", id="persistor")
+
+    # Add model selection widget and send to server
+    job.to(IntimeModelSelector(key_metric="eval_loss", negate_key_metric=True), "server", id="model_selector")
+
+    # Send ScriptRunner to all clients
+    for i in range(num_clients):
+        client_id = client_ids[i]
+        site_name = f"site-{client_id}"
+        data_path_train = os.path.join(args.data_path, client_id, "training.jsonl")
+        data_path_valid = os.path.join(args.data_path, client_id, "validation.jsonl")
+
+        script_args = f"--model_name_or_path {model_name_or_path} --data_path_train {data_path_train} --data_path_valid {data_path_valid} --output_path {output_path} --train_mode {train_mode} --message_mode {message_mode} --clean_up {clean_up}"
+        if message_mode == "tensor":
+            params_exchange_format = "pytorch"
+        elif message_mode == "numpy":
+            params_exchange_format = "numpy"
+        else:
+            raise ValueError(f"Invalid message_mode: {message_mode}, only numpy and tensor are supported.")
+
+        runner = ScriptRunner(
+            script=train_script,
+            script_args=script_args,
+            params_exchange_format=params_exchange_format,
+            launch_external_process=False,
+        )
+        job.to(runner, site_name, tasks=["train"])
+
+        if args.quantize_mode:
+            job.to(quantizor, site_name, tasks=["train"], filter_type=FilterType.TASK_RESULT)
+            job.to(dequantizor, site_name, tasks=["train"], filter_type=FilterType.TASK_DATA)
+
+    # Export the job
+    print("job_dir=", job_dir)
+    job.export_job(job_dir)
+
+    # Run the job
+    print("workspace_dir=", workspace_dir)
+    print("num_threads=", num_threads)
+    job.simulator_run(workspace_dir, threads=num_threads, gpu=args.gpu)
+
+
+def define_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--client_ids",
+        nargs="+",
+        type=str,
+        default="",
+        help="Clinet IDs, used to get the data path for each client",
+    )
+    parser.add_argument(
+        "--num_rounds",
+        type=int,
+        default=3,
+        help="Number of rounds, default to 3",
+    )
+    parser.add_argument(
+        "--workspace_dir",
+        type=str,
+        default="/tmp/nvflare/jobs/llm_hf/workdir",
+        help="work directory, default to '/tmp/nvflare/jobs/llm_hf/workdir'",
+    )
+    parser.add_argument(
+        "--job_dir",
+        type=str,
+        default="/tmp/nvflare/jobs/llm_hf/jobdir",
+        help="directory for job export, default to '/tmp/nvflare/jobs/llm_hf/jobdir'",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+        help="model name or path",
+    )
+    parser.add_argument(
+        "--data_path",
+        type=str,
+        default="",
+        help="root directory for training and validation data",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    parser.add_argument(
+        "--quantize_mode",
+        type=str,
+        default=None,
+        help="quantization mode, default to None (no quantization)",
+    )
+    parser.add_argument(
+        "--message_mode",
+        type=str,
+        default="numpy",
+        help="message mode, numpy or tensor, default to numpy",
+    )
+    parser.add_argument(
+        "--threads",
+        type=int,
+        help="number of threads to use for FL simulation, default to the number of clients",
+    )
+    parser.add_argument(
+        "--gpu",
+        type=str,
+        default="0",
+        help="gpu assignments for simulating clients, comma separated, default to single gpu",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/requirements.txt b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/requirements.txt
new file mode 100644
index 0000000000..4a58cb756e
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/requirements.txt
@@ -0,0 +1,7 @@
+torch==2.5.1
+datasets
+tensorboard
+transformers==4.48.0
+peft==0.14.0
+trl==0.13.0
+bitsandbytes
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/src/hf_peft_model.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/src/hf_peft_model.py
new file mode 100755
index 0000000000..b2545864c7
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/src/hf_peft_model.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+from peft import LoraConfig, get_peft_model
+from transformers import AutoModelForCausalLM
+
+
+class CausalLMPEFTModel(torch.nn.Module):
+    def __init__(self, model_name_or_path):
+        super(CausalLMPEFTModel, self).__init__()
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        full_model = AutoModelForCausalLM.from_pretrained(
+            model_name_or_path,
+        )
+        self.model = get_peft_model(full_model, peft_config)
+
+    def forward(self, input_id):
+        output = self.model(input_ids=input_id, return_dict=False)
+        return output
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/src/hf_sft_peft_fl.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/src/hf_sft_peft_fl.py
new file mode 100755
index 0000000000..96667151bc
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/src/hf_sft_peft_fl.py
@@ -0,0 +1,250 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import os
+
+# Add deterministic seed for reproducibility illustration
+import random
+
+import datasets
+import numpy as np
+import torch
+from peft import LoraConfig, get_peft_model, get_peft_model_state_dict, set_peft_model_state_dict, utils
+from transformers import AutoModelForCausalLM, trainer_utils
+from trl import SFTConfig, SFTTrainer
+
+import nvflare.client as flare
+
+torch.manual_seed(0)
+random.seed(0)
+np.random.seed(0)
+
+
+def format_instruction(example):
+    output_texts = []
+    for i in range(len(example["input"])):
+        text = f"### Instruction: Generate Output according to the information and question given by Input. ### Input:{example['input'][i]} ### Response: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+    )
+    parser.add_argument(
+        "--data_path_train",
+        type=str,
+        default="./dataset/dolly/training.jsonl",
+    )
+    parser.add_argument(
+        "--data_path_valid",
+        type=str,
+        default="./dataset/dolly/validation.jsonl",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./workspace_federated/llama-3.2-1b-dolly-sft",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    parser.add_argument(
+        "--message_mode",
+        type=str,
+        default="numpy",
+        help="message mode, numpy or tensor, default to numpy",
+    )
+    parser.add_argument("--local_epoch", type=int, default=1)
+    parser.add_argument("--clean_up", type=int, default=0)
+    args = parser.parse_args()
+
+    # Dataset
+    dataset_train = datasets.load_dataset("json", data_files=args.data_path_train, split="train")
+    dataset_valid = datasets.load_dataset("json", data_files=args.data_path_valid, split="train")
+    # Print dataset info
+    print(f"Dataset size: training {len(dataset_train)}, validation {len(dataset_valid)}")
+    # record every 5% of the dataset
+    batch_size = 4
+    gra_accu_steps = 10
+    logging_steps = int(len(dataset_train) / (20 * batch_size * gra_accu_steps))
+    print(f"logging_steps: {logging_steps}")
+
+    # Model configs
+    model_name_or_path = args.model_name_or_path
+    peft_config = None
+
+    # Load model
+    default_dtype = torch.get_default_dtype()
+    torch.set_default_dtype(torch.bfloat16)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path,
+        device_map="auto",
+        use_cache=False,
+        torch_dtype=torch.bfloat16,
+    )
+    torch.set_default_dtype(default_dtype)
+
+    # Train mode
+    if args.train_mode.lower() == "sft":
+        train_mode = 0
+    elif args.train_mode.lower() == "peft":
+        train_mode = 1
+    else:
+        raise ValueError(f"Invalid train_mode: {args.train_mode}, only SFT and PEFT are supported.")
+
+    # PEFT specific
+    if train_mode:
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        model = get_peft_model(model, peft_config)
+    model.config.pretraining_tp = 1
+
+    # Training arguments
+    train_args = SFTConfig(
+        output_dir=args.output_path,
+        num_train_epochs=args.local_epoch,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=gra_accu_steps,
+        gradient_checkpointing=False,
+        optim="paged_adamw_32bit",
+        logging_steps=logging_steps,
+        save_strategy="epoch",
+        learning_rate=5e-4,
+        bf16=True,
+        max_grad_norm=0.3,
+        warmup_ratio=0.03,
+        lr_scheduler_type="constant",
+        disable_tqdm=True,
+        max_seq_length=1024,
+        save_total_limit=2,
+        # safetensors has some issues in saving lm_head.weight, disable it for now
+        save_safetensors=False,
+    )
+
+    # Trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset_train,
+        eval_dataset=dataset_valid,
+        peft_config=peft_config,
+        formatting_func=format_instruction,
+        args=train_args,
+    )
+
+    # initializes NVFlare client API
+    flare.init()
+
+    # Train federated rounds
+    # start with global model at the beginning of each round
+    while flare.is_running():
+        # receives FLModel from NVFlare
+        input_model = flare.receive()
+        curr_round = input_model.current_round
+        print(f"current_round={curr_round}")
+
+        # Update the key name received from global model if using model def file
+        global_model = copy.deepcopy(input_model.params)
+        for key in list(global_model.keys()):
+            global_model[key.replace("model.", "", 1)] = global_model.pop(key)
+
+        # wraps evaluation logic into a method to re-use for
+        # evaluation on both trained and received model
+        def evaluate(input_weights, mode):
+            # Special load func for PEFT
+            if train_mode:
+                set_peft_model_state_dict(trainer.model, input_weights)
+            else:
+                trainer.model.load_state_dict(input_weights)
+            metric_score = trainer.evaluate()
+            print(f"Evaluation metric score: {metric_score}")
+            return metric_score
+
+        # evaluate on received global model
+        eval_loss = evaluate(global_model, train_mode)
+        eval_loss = float(eval_loss["eval_loss"])
+
+        # Load global model and previous training states
+        # Since we perform iterative training by using "resume" functionality
+        # we need to replace the resume weights with global weights every round
+        if curr_round == 0:
+            # First round, start from pretrained model
+            trainer.train()
+        else:
+            # replace local resume weights with global weights
+            resume_from_checkpoint_folder = trainer_utils.get_last_checkpoint(trainer.args.output_dir)
+            if train_mode:
+                # PEFT model small, directly save via torch.save
+                resume_model_file_path = os.path.join(resume_from_checkpoint_folder, utils.WEIGHTS_NAME)
+                torch.save(global_model, resume_model_file_path)
+            else:
+                # SFT model can be large, save via HF API
+                # Disable safetensor for now
+                trainer.model.save_pretrained(resume_from_checkpoint_folder, safe_serialization=False)
+            # increment num_train_epochs so that the trainer will continue training
+            if args.clean_up:
+                # runner got cleaned up, set num_train_epochs with curr_round
+                trainer.args.num_train_epochs = (curr_round + 1) * args.local_epoch
+            else:
+                # runner still alive, increment num_train_epochs with local_epoch
+                trainer.args.num_train_epochs += args.local_epoch
+            print(f"Increment num_train_epochs to {trainer.args.num_train_epochs}")
+            # continue training
+            trainer.train(resume_from_checkpoint=True)
+
+        # compose output model to send back to server
+        if train_mode:
+            # PEFT, load PEFT part from trainer model
+            out_param = get_peft_model_state_dict(trainer.model)
+        else:
+            # SFT, load whole model state_dict
+            out_param = trainer.model.state_dict()
+
+        # update the key name sent to global model
+        if not train_mode:
+            for key in list(out_param.keys()):
+                out_param["model." + key] = out_param.pop(key).cpu()
+
+        if args.message_mode.lower() == "numpy":
+            # cast out_param to float32 preparing for communication with numpy
+            # otherwise do nothing
+            out_param = {k: v.to(torch.float32) for k, v in out_param.items()}
+
+        # construct trained FL model
+        output_model = flare.FLModel(
+            params=out_param,
+            metrics={"eval_loss": eval_loss},
+            meta={"NUM_STEPS_CURRENT_ROUND": trainer.train_dataset.num_rows},
+        )
+        # send model back to NVFlare
+        flare.send(output_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/hf_sft_peft.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/hf_sft_peft.py
new file mode 100755
index 0000000000..1fc7d53826
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/hf_sft_peft.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+# Add deterministic seed for reproducibility illustration
+import random
+
+import datasets
+import numpy as np
+import torch
+from peft import LoraConfig, get_peft_model
+from transformers import AutoModelForCausalLM
+from trl import SFTConfig, SFTTrainer
+
+torch.manual_seed(0)
+random.seed(0)
+np.random.seed(0)
+
+
+def format_instruction(example):
+    output_texts = []
+    for i in range(len(example["input"])):
+        text = f"### Instruction: Generate Output according to the information and question given by Input. ### Input:{example['input'][i]} ### Response: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+    )
+    parser.add_argument(
+        "--data_path_train",
+        type=str,
+        default="/tmp/nvflare/dataset/llm/dolly/training.jsonl",
+    )
+    parser.add_argument(
+        "--data_path_valid",
+        type=str,
+        default="/tmp/nvflare/dataset/llm/dolly/validation.jsonl",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./workspace_centralized/llama-3.2-1b-dolly-sft",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    args = parser.parse_args()
+
+    # Dataset
+    dataset_train = datasets.load_dataset("json", data_files=args.data_path_train, split="train")
+    dataset_valid = datasets.load_dataset("json", data_files=args.data_path_valid, split="train")
+    # Print dataset info
+    print(f"Dataset size: training {len(dataset_train)}, validation {len(dataset_valid)}")
+    # record every 5% of the dataset
+    batch_size = 4
+    gra_accu_steps = 10
+    logging_steps = int(len(dataset_train) / (20 * batch_size * gra_accu_steps))
+    print(f"logging_steps: {logging_steps}")
+
+    # Model configs
+    model_name_or_path = args.model_name_or_path
+    peft_config = None
+
+    # Load model
+    default_dtype = torch.get_default_dtype()
+    torch.set_default_dtype(torch.bfloat16)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path,
+        device_map="auto",
+        use_cache=False,
+        torch_dtype=torch.bfloat16,
+    )
+    torch.set_default_dtype(default_dtype)
+
+    # Train mode
+    if args.train_mode.lower() == "sft":
+        train_mode = 0
+    elif args.train_mode.lower() == "peft":
+        train_mode = 1
+    else:
+        raise ValueError(f"Invalid train_mode: {args.train_mode}, only SFT and PEFT are supported.")
+
+    # PEFT specific
+    if train_mode:
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        model = get_peft_model(model, peft_config)
+    model.config.pretraining_tp = 1
+
+    # Training arguments
+    train_args = SFTConfig(
+        output_dir=args.output_path,
+        num_train_epochs=3,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=gra_accu_steps,
+        gradient_checkpointing=False,
+        optim="paged_adamw_32bit",
+        logging_steps=logging_steps,
+        save_strategy="epoch",
+        learning_rate=5e-4,
+        bf16=True,
+        max_grad_norm=0.3,
+        warmup_ratio=0.03,
+        lr_scheduler_type="constant",
+        disable_tqdm=True,
+        max_seq_length=1024,
+    )
+
+    # Trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset_train,
+        eval_dataset=dataset_valid,
+        peft_config=peft_config,
+        formatting_func=format_instruction,
+        args=train_args,
+    )
+
+    # Evaluate
+    trainer.evaluate()
+
+    # Train
+    trainer.train()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/preprocess_dolly.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/preprocess_dolly.py
new file mode 100755
index 0000000000..5c02ba9506
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/preprocess_dolly.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import numpy as np
+import pandas as pd
+
+
+def data_args():
+    parser = argparse.ArgumentParser(description="Preprocess data to train and validation files in jsonl format")
+    parser.add_argument("--training_file", type=str, required=True, help="Path to training set")
+    parser.add_argument("--validation_file", type=str, help="Path to validation set, if given, append to training data")
+    parser.add_argument("--validation_ratio", type=float, default=0.1, help="Ratio of validation set, defult to 10%")
+    parser.add_argument("--testing_ratio", type=float, default=0.1, help="Ratio of testing set, defult to 10%")
+    parser.add_argument("--output_dir", type=str, required=True, help="Path to output folder")
+    args = parser.parse_args()
+    return args
+
+
+def split_to_jsonl(data, output_dir, validation_ratio, testing_ratio):
+    print("Preprocessing data to NeMo_SFT jsonl format...")
+    output_path_tra = os.path.join(output_dir, "training.jsonl")
+    output_path_val = os.path.join(output_dir, "validation.jsonl")
+    output_path_tst = os.path.join(output_dir, "testing.jsonl")
+
+    data_ct = len(data)
+    val_threshold = int(data_ct * validation_ratio)
+    test_threshold = int(data_ct * testing_ratio)
+
+    with open(output_path_val, "w") as g, open(output_path_tst, "w") as h, open(output_path_tra, "w") as i:
+        for index, item in data.iterrows():
+            context = item["context"].strip()
+            if context != "":
+                # Randomize context and instruction order.
+                context_first = np.random.randint(0, 2) == 0
+                if context_first:
+                    instruction = item["instruction"].strip()
+                    assert instruction != ""
+                    input = f"{context}\n\n{instruction}"
+                    output = item["response"]
+                else:
+                    instruction = item["instruction"].strip()
+                    assert instruction != ""
+                    input = f"{instruction}\n\n{context}"
+                    output = item["response"]
+            else:
+                input = item["instruction"]
+                output = item["response"]
+            # write to jsonl file according to index
+            if index < val_threshold:
+                h.write(json.dumps({"input": input, "output": output}) + "\n")
+            elif index < val_threshold + test_threshold:
+                g.write(json.dumps({"input": input, "output": output}) + "\n")
+            else:
+                i.write(json.dumps({"input": input, "output": output}) + "\n")
+    print(f"{index + 1} out of {data_ct} Data was successfully preprocessed and saved.")
+
+
+def main():
+    args = data_args()
+    # load training data
+    path_to_train = args.training_file
+    train = pd.read_json(path_to_train, lines=True)
+    # load validation data if provided and append to training data
+    if args.validation_file:
+        path_to_val = args.validation_file
+        val = pd.read_json(path_to_val, lines=True)
+        train = pd.concat([train, val])
+    # randomize the order of the data
+    data_full = train.sample(frac=1, random_state=0).reset_index(drop=True)
+    # split data into training, validation and testing
+    val_ratio = args.validation_ratio
+    test_ratio = args.testing_ratio
+    output_dir = args.output_dir
+    split_to_jsonl(data_full, output_dir, val_ratio, test_ratio)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/preprocess_oasst1.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/preprocess_oasst1.py
new file mode 100755
index 0000000000..de4de63040
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_peft/utils/preprocess_oasst1.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import pandas as pd
+import pyarrow.parquet as pq
+
+
+def data_args():
+    parser = argparse.ArgumentParser(description="Preprocess data to train and validation files in jsonl format")
+    parser.add_argument("--training_file", type=str, required=True, help="Path to training set")
+    parser.add_argument("--validation_file", type=str, help="Path to validation set, if given, append to training data")
+    parser.add_argument("--validation_ratio", type=float, default=0.1, help="Ratio of validation set, defult to 10%")
+    parser.add_argument("--testing_ratio", type=float, default=0.1, help="Ratio of testing set, defult to 10%")
+    parser.add_argument("--output_dir", type=str, required=True, help="Path to output folder")
+    args = parser.parse_args()
+    return args
+
+
+def get_data_for_sft(data):
+    data_assistant = data[(data.role == "assistant") & (data["rank"] == 0.0)].copy()
+    data_prompter = data[(data.role == "prompter")].copy()
+    data_prompter = data_prompter.set_index("message_id")
+    data_assistant["output"] = data_assistant["text"].values
+
+    inputs = []
+    parent_ids = []
+    for index, item in data_assistant.iterrows():
+        input = data_prompter.loc[item.parent_id]
+        inputs.append(input.text)
+        parent_ids.append(input.parent_id)
+    data_assistant["instruction"] = inputs
+    data_assistant["parent_id"] = parent_ids
+    data_assistant = data_assistant[data_assistant.lang == "en"]
+    data_assistant = data_assistant[["instruction", "output"]]
+    return data_assistant
+
+
+def split_to_jsonl(data, output_dir, validation_ratio, testing_ratio):
+    print("Preprocessing data to NeMo_SFT jsonl format...")
+    output_path_tra = os.path.join(output_dir, "training.jsonl")
+    output_path_val = os.path.join(output_dir, "validation.jsonl")
+    output_path_tst = os.path.join(output_dir, "testing.jsonl")
+
+    data_ct = len(data)
+    val_threshold = int(data_ct * validation_ratio)
+    test_threshold = int(data_ct * testing_ratio)
+
+    with open(output_path_val, "w") as g, open(output_path_tst, "w") as h, open(output_path_tra, "w") as i:
+        for index, item in data.iterrows():
+            input = item["instruction"]
+            output = item["output"]
+            # write to jsonl file according to index
+            if index < val_threshold:
+                h.write(json.dumps({"input": input, "output": output}) + "\n")
+            elif index < val_threshold + test_threshold:
+                g.write(json.dumps({"input": input, "output": output}) + "\n")
+            else:
+                i.write(json.dumps({"input": input, "output": output}) + "\n")
+    print(f"{index + 1} out of {data_ct} Data was successfully preprocessed and saved.")
+
+
+def main():
+    args = data_args()
+    # load training data
+    path_to_train = args.training_file
+    ds = pq.read_table(path_to_train)
+    data = ds.to_pandas()
+    train = get_data_for_sft(data)
+    # load validation data if provided and append to training data
+    if args.validation_file:
+        path_to_val = args.validation_file
+        ds = pq.read_table(path_to_val)
+        data = ds.to_pandas()
+        val = get_data_for_sft(data)
+        train = pd.concat([train, val])
+    # randomize the order of the data
+    data_full = train.sample(frac=1, random_state=0).reset_index(drop=True)
+    # split data into training, validation and testing
+    val_ratio = args.validation_ratio
+    test_ratio = args.testing_ratio
+    output_dir = args.output_dir
+    split_to_jsonl(data_full, output_dir, val_ratio, test_ratio)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_sft/LLM_SFT.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_sft/LLM_SFT.ipynb
deleted file mode 100644
index 64e27e07dc..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.3_llm_sft/LLM_SFT.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2ba95511-a7b7-41be-8ae1-1655c7906ec6",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_fed_nlp/federated_nlp_with_bert.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_fed_nlp/federated_nlp_with_bert.ipynb
deleted file mode 100644
index aa1d5fe055..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_fed_nlp/federated_nlp_with_bert.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e621496d-a92c-46b3-ab7a-ba693b737a2b",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/LLM_quantization.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/LLM_quantization.ipynb
new file mode 100644
index 0000000000..820743aac5
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/LLM_quantization.ipynb
@@ -0,0 +1,164 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "dac4d921-f6ec-4d6e-b509-5d49edea9fd5",
+   "metadata": {},
+   "source": [
+    "# Model Quantization for Communication\n",
+    "In the previous examples, we used numpy in float32 for communication. To reduce the message size, we can use model precision conversion and quantization \n",
+    "from float32 to 16-bit, 8-bit, and 4-bit for communication. Quantization is enabled by NVFlare's [filter mechanism](https://nvflare.readthedocs.io/en/main/programming_guide/filters.html). We can use the following command to run the federated training with model quantization.\n",
+    "16-bit is a direct precision conversion, while 8-bit, 4-bit quantization is performed by [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes/tree/main).\n",
+    "Note that 4-bit quantizations (`fp4` or `nf4`) need device support."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92762739-4f73-4d52-9fb6-a8f4a3989eb1",
+   "metadata": {},
+   "source": [
+    "## Data Preparation\n",
+    "Again, we use one dataset to illustrate the SFT. We download and preprocess [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "69a9f90e-e7ee-4720-9813-6c94df303083",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k /tmp/nvflare/dataset/llm/dolly\n",
+    "! python utils/preprocess_dolly.py --training_file /tmp/nvflare/dataset/llm/dolly/databricks-dolly-15k.jsonl --output_dir /tmp/nvflare/dataset/llm/dolly"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1d7cfd0-dd29-4cfd-960e-45315a6a09c7",
+   "metadata": {},
+   "source": [
+    "We run the same SFT pipeline with different quantization configurations:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50c04163-b660-4f64-a41d-a8b480dfdf0a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python sft_job.py --client_ids dolly --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/dolly_fl_sft_16 --job_dir /tmp/nvflare/workspace/jobs/llm_hf_sft_16 --train_mode SFT --quantize_mode float16\n",
+    "! python sft_job.py --client_ids dolly --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/dolly_fl_sft_8 --job_dir /tmp/nvflare/workspace/jobs/llm_hf_sft_8 --train_mode SFT --quantize_mode blockwise8\n",
+    "! python sft_job.py --client_ids dolly --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/dolly_fl_sft_fp4 --job_dir /tmp/nvflare/workspace/jobs/llm_hf_sft_fp4 --train_mode SFT --quantize_mode float4\n",
+    "! python sft_job.py --client_ids dolly --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/dolly_fl_sft_nf4 --job_dir /tmp/nvflare/workspace/jobs/llm_hf_sft_nf4 --train_mode SFT --quantize_mode normfloat4"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "559bdca9-b9c7-45dd-93b8-34da452fd3e2",
+   "metadata": {},
+   "source": [
+    "For message reduce, from float32 to 16-/8-/4-bit, the message size (in MB) of Llama-3.2-1B model are reduced to: \n",
+    "\n",
+    "| Quantization      | Raw Model Size | Quantized Model Size | Quantization Meta Size |\n",
+    "|-------------------|----------------|----------------------|------------------------|\n",
+    "| float16           | 5716.26        | 2858.13              | 0.00                   |\n",
+    "| blockwise8        | 5716.26        | 1429.06              | 1.54                   |\n",
+    "| float4            | 5716.26        | 714.53               | 89.33                  |\n",
+    "| normalized float4 | 5716.26        | 714.53               | 89.33                  |\n",
+    "\n",
+    "Note that quantization will generate additional meta data, which can be significant for 4-bit cases.\n",
+    "\n",
+    "## Model Communication with Tensor\n",
+    "In addition, since the model is trained with bf16, instead of first converting to numpy in float32, we can directly communicate with tensor in bf16 to avoid the message size inflation due to the conversion. \n",
+    "We can use the following command to run the federated training with direct tensor communication, without and with quantization:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f12509-bc5f-49b8-969b-185bb3310e3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! python sft_job.py --client_ids dolly --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/dolly_fl_sft_tensor --job_dir /tmp/nvflare/workspace/jobs/llm_hf_sft_tensor --train_mode SFT  --message_mode tensor\n",
+    "! python sft_job.py --client_ids dolly --data_path /tmp/nvflare/dataset/llm/ --workspace_dir /tmp/nvflare/workspace/llm/dolly_fl_sft_tensor_fp4 --job_dir /tmp/nvflare/workspace/jobs/llm_hf_sft_tensor_fp4 --train_mode SFT  --message_mode tensor --quantize_mode float4"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bdfd4997-493a-40f5-82dc-86ec7a224513",
+   "metadata": {},
+   "source": [
+    "In this case, since the tensor is in bf16, and the quantization reduces it to float4, the message size change is thus:\n",
+    "```\n",
+    "Before quantization: 2858.13 MB. After quantization: 714.53 MB with meta: 89.33 MB.\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b63d733d-1a4f-4f41-b7c1-62f204991a46",
+   "metadata": {},
+   "source": [
+    "## Training Curves"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "acfe7fba-5ed9-425a-9b6a-678619a2a759",
+   "metadata": {},
+   "source": [
+    "The SFT curves are shown below, we can see it achieves decent alignments. These results show that for the example training schemes and data, model precision conversion / quantization does not significantly impact the training while reducing the message size to 1/2, 1/4, and even 1/8, which can significantly reduce the message size, making it crucial for transmitting LLM updates."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f0632ec-1df1-4894-a1a9-557f568fd468",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext tensorboard\n",
+    "%tensorboard --logdir /tmp/nvflare/workspace/llm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4637b943-76da-4b3e-8e57-353d4ec0d17e",
+   "metadata": {},
+   "source": [
+    "Quantization significantly reduced the communication burden by reducinng the message size sent over the network, however at local level, memory usage is still demanding to prepare the messages - large memory needs to be allocated to hold the LLM weights. Therefore, let's move on to the next section addressing this challenge - [LLM Streaming](../08.5_llm_streaming/LLM_streaming.ipynb))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "834dd765-3321-4644-b64d-e3a796579a05",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/sft_job.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/sft_job.py
new file mode 100644
index 0000000000..c13cbb11d1
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/sft_job.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from nvflare import FedJob, FilterType
+from nvflare.app_common.widgets.intime_model_selector import IntimeModelSelector
+from nvflare.app_common.workflows.fedavg import FedAvg
+from nvflare.app_opt.pt.file_model_persistor import PTFileModelPersistor
+from nvflare.app_opt.pt.quantization.dequantizor import ModelDequantizor
+from nvflare.app_opt.pt.quantization.quantizor import ModelQuantizor
+from nvflare.job_config.script_runner import ScriptRunner
+
+
+def main():
+    args = define_parser()
+    train_script = "src/hf_sft_peft_fl.py"
+    client_ids = args.client_ids
+    num_clients = len(client_ids)
+
+    if args.threads:
+        num_threads = args.threads
+    else:
+        num_threads = num_clients
+
+    if num_threads < num_clients:
+        print("The number of threads smaller than the number of clients, runner clean-up will be performed.")
+        clean_up = 1
+    else:
+        clean_up = 0
+
+    num_rounds = args.num_rounds
+    workspace_dir = args.workspace_dir
+    job_dir = args.job_dir
+    model_name_or_path = args.model_name_or_path
+    train_mode = args.train_mode
+    message_mode = args.message_mode
+
+    # Create the FedJob
+    if train_mode.lower() == "sft":
+        job = FedJob(name="llm_hf_sft", min_clients=num_clients)
+        output_path = "sft"
+    elif train_mode.lower() == "peft":
+        job = FedJob(name="llm_hf_peft", min_clients=num_clients)
+        output_path = "peft"
+    else:
+        raise ValueError(f"Invalid train_mode: {train_mode}, only SFT and PEFT are supported.")
+
+    # Define the FedAvg controller workflow and send to server
+    controller = FedAvg(
+        num_clients=num_clients,
+        num_rounds=num_rounds,
+    )
+    job.to(controller, "server")
+
+    if args.quantize_mode:
+        # If using quantization, add quantize filters.
+        quantizor = ModelQuantizor(quantization_type=args.quantize_mode)
+        dequantizor = ModelDequantizor()
+        job.to(quantizor, "server", tasks=["train"], filter_type=FilterType.TASK_DATA)
+        job.to(dequantizor, "server", tasks=["train"], filter_type=FilterType.TASK_RESULT)
+
+    # Define the model persistor and send to server
+    # First send the model to the server
+    job.to("src/hf_sft_model.py", "server")
+    # Then send the model persistor to the server
+    model_args = {"path": "src.hf_sft_model.CausalLMModel", "args": {"model_name_or_path": model_name_or_path}}
+    job.to(PTFileModelPersistor(model=model_args), "server", id="persistor")
+
+    # Add model selection widget and send to server
+    job.to(IntimeModelSelector(key_metric="eval_loss", negate_key_metric=True), "server", id="model_selector")
+
+    # Send ScriptRunner to all clients
+    for i in range(num_clients):
+        client_id = client_ids[i]
+        site_name = f"site-{client_id}"
+        data_path_train = os.path.join(args.data_path, client_id, "training.jsonl")
+        data_path_valid = os.path.join(args.data_path, client_id, "validation.jsonl")
+
+        script_args = f"--model_name_or_path {model_name_or_path} --data_path_train {data_path_train} --data_path_valid {data_path_valid} --output_path {output_path} --train_mode {train_mode} --message_mode {message_mode} --clean_up {clean_up}"
+        if message_mode == "tensor":
+            params_exchange_format = "pytorch"
+        elif message_mode == "numpy":
+            params_exchange_format = "numpy"
+        else:
+            raise ValueError(f"Invalid message_mode: {message_mode}, only numpy and tensor are supported.")
+
+        runner = ScriptRunner(
+            script=train_script,
+            script_args=script_args,
+            params_exchange_format=params_exchange_format,
+            launch_external_process=False,
+        )
+        job.to(runner, site_name, tasks=["train"])
+
+        if args.quantize_mode:
+            job.to(quantizor, site_name, tasks=["train"], filter_type=FilterType.TASK_RESULT)
+            job.to(dequantizor, site_name, tasks=["train"], filter_type=FilterType.TASK_DATA)
+
+    # Export the job
+    print("job_dir=", job_dir)
+    job.export_job(job_dir)
+
+    # Run the job
+    print("workspace_dir=", workspace_dir)
+    print("num_threads=", num_threads)
+    job.simulator_run(workspace_dir, threads=num_threads, gpu=args.gpu)
+
+
+def define_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--client_ids",
+        nargs="+",
+        type=str,
+        default="",
+        help="Clinet IDs, used to get the data path for each client",
+    )
+    parser.add_argument(
+        "--num_rounds",
+        type=int,
+        default=3,
+        help="Number of rounds, default to 5",
+    )
+    parser.add_argument(
+        "--workspace_dir",
+        type=str,
+        default="/tmp/nvflare/jobs/llm_hf/workdir",
+        help="work directory, default to '/tmp/nvflare/jobs/llm_hf/workdir'",
+    )
+    parser.add_argument(
+        "--job_dir",
+        type=str,
+        default="/tmp/nvflare/jobs/llm_hf/jobdir",
+        help="directory for job export, default to '/tmp/nvflare/jobs/llm_hf/jobdir'",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+        help="model name or path",
+    )
+    parser.add_argument(
+        "--data_path",
+        type=str,
+        default="",
+        help="root directory for training and validation data",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    parser.add_argument(
+        "--quantize_mode",
+        type=str,
+        default=None,
+        help="quantization mode, default to None (no quantization)",
+    )
+    parser.add_argument(
+        "--message_mode",
+        type=str,
+        default="numpy",
+        help="message mode, numpy or tensor, default to numpy",
+    )
+    parser.add_argument(
+        "--threads",
+        type=int,
+        help="number of threads to use for FL simulation, default to the number of clients",
+    )
+    parser.add_argument(
+        "--gpu",
+        type=str,
+        default="0",
+        help="gpu assignments for simulating clients, comma separated, default to single gpu",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/src/hf_sft_model.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/src/hf_sft_model.py
new file mode 100755
index 0000000000..fd84c3f06a
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/src/hf_sft_model.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+from transformers import AutoModelForCausalLM
+
+
+class CausalLMModel(torch.nn.Module):
+    def __init__(self, model_name_or_path):
+        super(CausalLMModel, self).__init__()
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_name_or_path,
+        )
+
+    def forward(self, input_id):
+        output = self.model(input_ids=input_id, return_dict=False)
+        return output
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/src/hf_sft_peft_fl.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/src/hf_sft_peft_fl.py
new file mode 100755
index 0000000000..96667151bc
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/src/hf_sft_peft_fl.py
@@ -0,0 +1,250 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import os
+
+# Add deterministic seed for reproducibility illustration
+import random
+
+import datasets
+import numpy as np
+import torch
+from peft import LoraConfig, get_peft_model, get_peft_model_state_dict, set_peft_model_state_dict, utils
+from transformers import AutoModelForCausalLM, trainer_utils
+from trl import SFTConfig, SFTTrainer
+
+import nvflare.client as flare
+
+torch.manual_seed(0)
+random.seed(0)
+np.random.seed(0)
+
+
+def format_instruction(example):
+    output_texts = []
+    for i in range(len(example["input"])):
+        text = f"### Instruction: Generate Output according to the information and question given by Input. ### Input:{example['input'][i]} ### Response: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+    )
+    parser.add_argument(
+        "--data_path_train",
+        type=str,
+        default="./dataset/dolly/training.jsonl",
+    )
+    parser.add_argument(
+        "--data_path_valid",
+        type=str,
+        default="./dataset/dolly/validation.jsonl",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./workspace_federated/llama-3.2-1b-dolly-sft",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    parser.add_argument(
+        "--message_mode",
+        type=str,
+        default="numpy",
+        help="message mode, numpy or tensor, default to numpy",
+    )
+    parser.add_argument("--local_epoch", type=int, default=1)
+    parser.add_argument("--clean_up", type=int, default=0)
+    args = parser.parse_args()
+
+    # Dataset
+    dataset_train = datasets.load_dataset("json", data_files=args.data_path_train, split="train")
+    dataset_valid = datasets.load_dataset("json", data_files=args.data_path_valid, split="train")
+    # Print dataset info
+    print(f"Dataset size: training {len(dataset_train)}, validation {len(dataset_valid)}")
+    # record every 5% of the dataset
+    batch_size = 4
+    gra_accu_steps = 10
+    logging_steps = int(len(dataset_train) / (20 * batch_size * gra_accu_steps))
+    print(f"logging_steps: {logging_steps}")
+
+    # Model configs
+    model_name_or_path = args.model_name_or_path
+    peft_config = None
+
+    # Load model
+    default_dtype = torch.get_default_dtype()
+    torch.set_default_dtype(torch.bfloat16)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path,
+        device_map="auto",
+        use_cache=False,
+        torch_dtype=torch.bfloat16,
+    )
+    torch.set_default_dtype(default_dtype)
+
+    # Train mode
+    if args.train_mode.lower() == "sft":
+        train_mode = 0
+    elif args.train_mode.lower() == "peft":
+        train_mode = 1
+    else:
+        raise ValueError(f"Invalid train_mode: {args.train_mode}, only SFT and PEFT are supported.")
+
+    # PEFT specific
+    if train_mode:
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        model = get_peft_model(model, peft_config)
+    model.config.pretraining_tp = 1
+
+    # Training arguments
+    train_args = SFTConfig(
+        output_dir=args.output_path,
+        num_train_epochs=args.local_epoch,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=gra_accu_steps,
+        gradient_checkpointing=False,
+        optim="paged_adamw_32bit",
+        logging_steps=logging_steps,
+        save_strategy="epoch",
+        learning_rate=5e-4,
+        bf16=True,
+        max_grad_norm=0.3,
+        warmup_ratio=0.03,
+        lr_scheduler_type="constant",
+        disable_tqdm=True,
+        max_seq_length=1024,
+        save_total_limit=2,
+        # safetensors has some issues in saving lm_head.weight, disable it for now
+        save_safetensors=False,
+    )
+
+    # Trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset_train,
+        eval_dataset=dataset_valid,
+        peft_config=peft_config,
+        formatting_func=format_instruction,
+        args=train_args,
+    )
+
+    # initializes NVFlare client API
+    flare.init()
+
+    # Train federated rounds
+    # start with global model at the beginning of each round
+    while flare.is_running():
+        # receives FLModel from NVFlare
+        input_model = flare.receive()
+        curr_round = input_model.current_round
+        print(f"current_round={curr_round}")
+
+        # Update the key name received from global model if using model def file
+        global_model = copy.deepcopy(input_model.params)
+        for key in list(global_model.keys()):
+            global_model[key.replace("model.", "", 1)] = global_model.pop(key)
+
+        # wraps evaluation logic into a method to re-use for
+        # evaluation on both trained and received model
+        def evaluate(input_weights, mode):
+            # Special load func for PEFT
+            if train_mode:
+                set_peft_model_state_dict(trainer.model, input_weights)
+            else:
+                trainer.model.load_state_dict(input_weights)
+            metric_score = trainer.evaluate()
+            print(f"Evaluation metric score: {metric_score}")
+            return metric_score
+
+        # evaluate on received global model
+        eval_loss = evaluate(global_model, train_mode)
+        eval_loss = float(eval_loss["eval_loss"])
+
+        # Load global model and previous training states
+        # Since we perform iterative training by using "resume" functionality
+        # we need to replace the resume weights with global weights every round
+        if curr_round == 0:
+            # First round, start from pretrained model
+            trainer.train()
+        else:
+            # replace local resume weights with global weights
+            resume_from_checkpoint_folder = trainer_utils.get_last_checkpoint(trainer.args.output_dir)
+            if train_mode:
+                # PEFT model small, directly save via torch.save
+                resume_model_file_path = os.path.join(resume_from_checkpoint_folder, utils.WEIGHTS_NAME)
+                torch.save(global_model, resume_model_file_path)
+            else:
+                # SFT model can be large, save via HF API
+                # Disable safetensor for now
+                trainer.model.save_pretrained(resume_from_checkpoint_folder, safe_serialization=False)
+            # increment num_train_epochs so that the trainer will continue training
+            if args.clean_up:
+                # runner got cleaned up, set num_train_epochs with curr_round
+                trainer.args.num_train_epochs = (curr_round + 1) * args.local_epoch
+            else:
+                # runner still alive, increment num_train_epochs with local_epoch
+                trainer.args.num_train_epochs += args.local_epoch
+            print(f"Increment num_train_epochs to {trainer.args.num_train_epochs}")
+            # continue training
+            trainer.train(resume_from_checkpoint=True)
+
+        # compose output model to send back to server
+        if train_mode:
+            # PEFT, load PEFT part from trainer model
+            out_param = get_peft_model_state_dict(trainer.model)
+        else:
+            # SFT, load whole model state_dict
+            out_param = trainer.model.state_dict()
+
+        # update the key name sent to global model
+        if not train_mode:
+            for key in list(out_param.keys()):
+                out_param["model." + key] = out_param.pop(key).cpu()
+
+        if args.message_mode.lower() == "numpy":
+            # cast out_param to float32 preparing for communication with numpy
+            # otherwise do nothing
+            out_param = {k: v.to(torch.float32) for k, v in out_param.items()}
+
+        # construct trained FL model
+        output_model = flare.FLModel(
+            params=out_param,
+            metrics={"eval_loss": eval_loss},
+            meta={"NUM_STEPS_CURRENT_ROUND": trainer.train_dataset.num_rows},
+        )
+        # send model back to NVFlare
+        flare.send(output_model)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/utils/hf_sft_peft.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/utils/hf_sft_peft.py
new file mode 100755
index 0000000000..862148368c
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/utils/hf_sft_peft.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+# Add deterministic seed for reproducibility illustration
+import random
+
+import datasets
+import numpy as np
+import torch
+from peft import LoraConfig, get_peft_model
+from transformers import AutoModelForCausalLM
+from trl import SFTConfig, SFTTrainer
+
+torch.manual_seed(0)
+random.seed(0)
+np.random.seed(0)
+
+
+def format_instruction(example):
+    output_texts = []
+    for i in range(len(example["input"])):
+        text = f"### Instruction: Generate Output according to the information and question given by Input. ### Input:{example['input'][i]} ### Response: {example['output'][i]}"
+        output_texts.append(text)
+    return output_texts
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="meta-llama/llama-3.2-1b",
+    )
+    parser.add_argument(
+        "--data_path_train",
+        type=str,
+        default="./dataset/dolly/training.jsonl",
+    )
+    parser.add_argument(
+        "--data_path_valid",
+        type=str,
+        default="./dataset/dolly/validation.jsonl",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="./workspace_centralized/llama-3.2-1b-dolly-sft",
+    )
+    parser.add_argument(
+        "--train_mode",
+        type=str,
+        default="SFT",
+        help="training mode, SFT or PEFT, default to SFT",
+    )
+    args = parser.parse_args()
+
+    # Dataset
+    dataset_train = datasets.load_dataset("json", data_files=args.data_path_train, split="train")
+    dataset_valid = datasets.load_dataset("json", data_files=args.data_path_valid, split="train")
+    # Print dataset info
+    print(f"Dataset size: training {len(dataset_train)}, validation {len(dataset_valid)}")
+    # record every 5% of the dataset
+    batch_size = 4
+    gra_accu_steps = 10
+    logging_steps = int(len(dataset_train) / (20 * batch_size * gra_accu_steps))
+    print(f"logging_steps: {logging_steps}")
+
+    # Model configs
+    model_name_or_path = args.model_name_or_path
+    peft_config = None
+
+    # Load model
+    default_dtype = torch.get_default_dtype()
+    torch.set_default_dtype(torch.bfloat16)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path,
+        device_map="auto",
+        use_cache=False,
+        torch_dtype=torch.bfloat16,
+    )
+    torch.set_default_dtype(default_dtype)
+
+    # Train mode
+    if args.train_mode.lower() == "sft":
+        train_mode = 0
+    elif args.train_mode.lower() == "peft":
+        train_mode = 1
+    else:
+        raise ValueError(f"Invalid train_mode: {args.train_mode}, only SFT and PEFT are supported.")
+
+    # PEFT specific
+    if train_mode:
+        # PEFT configs
+        peft_config = LoraConfig(
+            lora_alpha=16,
+            lora_dropout=0.1,
+            r=64,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        model = get_peft_model(model, peft_config)
+    model.config.pretraining_tp = 1
+
+    # Training arguments
+    train_args = SFTConfig(
+        output_dir=args.output_path,
+        num_train_epochs=3,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=gra_accu_steps,
+        gradient_checkpointing=False,
+        optim="paged_adamw_32bit",
+        logging_steps=logging_steps,
+        save_strategy="epoch",
+        learning_rate=5e-4,
+        bf16=True,
+        max_grad_norm=0.3,
+        warmup_ratio=0.03,
+        lr_scheduler_type="constant",
+        disable_tqdm=True,
+        max_seq_length=1024,
+    )
+
+    # Trainer
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset_train,
+        eval_dataset=dataset_valid,
+        peft_config=peft_config,
+        formatting_func=format_instruction,
+        args=train_args,
+    )
+
+    # Evaluate
+    trainer.evaluate()
+
+    # Train
+    trainer.train()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/utils/preprocess_dolly.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/utils/preprocess_dolly.py
new file mode 100755
index 0000000000..5c02ba9506
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.4_llm_quantization/utils/preprocess_dolly.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import numpy as np
+import pandas as pd
+
+
+def data_args():
+    parser = argparse.ArgumentParser(description="Preprocess data to train and validation files in jsonl format")
+    parser.add_argument("--training_file", type=str, required=True, help="Path to training set")
+    parser.add_argument("--validation_file", type=str, help="Path to validation set, if given, append to training data")
+    parser.add_argument("--validation_ratio", type=float, default=0.1, help="Ratio of validation set, defult to 10%")
+    parser.add_argument("--testing_ratio", type=float, default=0.1, help="Ratio of testing set, defult to 10%")
+    parser.add_argument("--output_dir", type=str, required=True, help="Path to output folder")
+    args = parser.parse_args()
+    return args
+
+
+def split_to_jsonl(data, output_dir, validation_ratio, testing_ratio):
+    print("Preprocessing data to NeMo_SFT jsonl format...")
+    output_path_tra = os.path.join(output_dir, "training.jsonl")
+    output_path_val = os.path.join(output_dir, "validation.jsonl")
+    output_path_tst = os.path.join(output_dir, "testing.jsonl")
+
+    data_ct = len(data)
+    val_threshold = int(data_ct * validation_ratio)
+    test_threshold = int(data_ct * testing_ratio)
+
+    with open(output_path_val, "w") as g, open(output_path_tst, "w") as h, open(output_path_tra, "w") as i:
+        for index, item in data.iterrows():
+            context = item["context"].strip()
+            if context != "":
+                # Randomize context and instruction order.
+                context_first = np.random.randint(0, 2) == 0
+                if context_first:
+                    instruction = item["instruction"].strip()
+                    assert instruction != ""
+                    input = f"{context}\n\n{instruction}"
+                    output = item["response"]
+                else:
+                    instruction = item["instruction"].strip()
+                    assert instruction != ""
+                    input = f"{instruction}\n\n{context}"
+                    output = item["response"]
+            else:
+                input = item["instruction"]
+                output = item["response"]
+            # write to jsonl file according to index
+            if index < val_threshold:
+                h.write(json.dumps({"input": input, "output": output}) + "\n")
+            elif index < val_threshold + test_threshold:
+                g.write(json.dumps({"input": input, "output": output}) + "\n")
+            else:
+                i.write(json.dumps({"input": input, "output": output}) + "\n")
+    print(f"{index + 1} out of {data_ct} Data was successfully preprocessed and saved.")
+
+
+def main():
+    args = data_args()
+    # load training data
+    path_to_train = args.training_file
+    train = pd.read_json(path_to_train, lines=True)
+    # load validation data if provided and append to training data
+    if args.validation_file:
+        path_to_val = args.validation_file
+        val = pd.read_json(path_to_val, lines=True)
+        train = pd.concat([train, val])
+    # randomize the order of the data
+    data_full = train.sample(frac=1, random_state=0).reset_index(drop=True)
+    # split data into training, validation and testing
+    val_ratio = args.validation_ratio
+    test_ratio = args.testing_ratio
+    output_dir = args.output_dir
+    split_to_jsonl(data_full, output_dir, val_ratio, test_ratio)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/LLM_streaming.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/LLM_streaming.ipynb
new file mode 100644
index 0000000000..ec9fe8b31e
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/LLM_streaming.ipynb
@@ -0,0 +1,138 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6ad30489-5e3d-4994-966d-666bd40a13e0",
+   "metadata": {},
+   "source": [
+    "# Object Streaming\n",
+    "\n",
+    "## Overview\n",
+    "The examples here demonstrate how to use object streamers to send large objects in a memory-efficient manner.\n",
+    "\n",
+    "Current default setting is to send and receive large objects in full, so extra memory will be needed and allocated to hold the received message. \n",
+    "This works fine when the message is small, but can become a limit when model size is large, e.g. for large language models.\n",
+    "\n",
+    "To save on memory usage, we can stream the message send / receive: when sending large objects (e.g. a dict),\n",
+    "streamer sends containers entry by entry (e.g. one dict item each time); further, if we save the object to a file, \n",
+    "streamer can send the file by chunks (default chunk size is 1MB).\n",
+    "\n",
+    "Thus, the memory demand can be reduced to the size of the largest entry for container streaming; while nearly no extra memory is needed for file\n",
+    "streaming. For example, if sending a dict with 10 1GB entries, without streaming, it will take 10GB extra space to send the dict. \n",
+    "With container streaming, it only requires extra 1GB; and if saved to a file before sending, it only requires 1MB extra space to send the file.\n",
+    "\n",
+    "All examples are run with NVFlare Simulator via [JobAPI](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html).\n",
+    "## Concepts\n",
+    "\n",
+    "### Object Streamer\n",
+    "ObjectStreamer is the base class to stream an object piece by piece. The `StreamableEngine` built in the NVFlare can\n",
+    "stream any implementations of ObjectSteamer\n",
+    "\n",
+    "The following implementations are included in NVFlare,\n",
+    "\n",
+    "* `ContainerStreamer`: This class is used to stream a container entry by entry. Currently, dict, list and set are supported\n",
+    "* `FileStreamer`: This class is used to stream a file\n",
+    "\n",
+    "Note that the container streamer split the stream by the top level entries. All the sub entries of a top entry are expected to be\n",
+    "sent as a whole, therefore the memory is determined by the largest entry at top level.\n",
+    "\n",
+    "### Object Retriever\n",
+    "Building upon the streamers, `ObjectRetriever` is designed for easier integration with existing code: to request an object to be streamed from a remote site. It automatically sets up the streaming\n",
+    "on both ends and handles the coordination.\n",
+    "\n",
+    "Similarly, the following implementations are available,\n",
+    "\n",
+    "* `ContainerRetriever`: This class is used to retrieve a container from remote site using `ContainerStreamer`.\n",
+    "* `FileRetriever`: This class is used to retrieve a file from remote site using `FileStreamer`.\n",
+    "\n",
+    "Note that to use ContainerRetriever, the container must be given a name and added on the sending site,\n",
+    "```\n",
+    "ContainerRetriever.add_container(\"model\", model_dict)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "736c3e06-c9e2-48ee-b787-3d00efa8d37d",
+   "metadata": {},
+   "source": [
+    "## Full-scale Examples and Comparisons\n",
+    "In the following, we will demonstrate how to use the streamer with Retriever in a workflow with real large language model object, \n",
+    "and compare the memory usage with and without streaming. To track the memory usage, we use a simple script `utils/log_memory.sh`. \n",
+    "Note that the tracked usage is not fully accurate, but it is sufficient to give us a rough idea.\n",
+    "\n",
+    "With a simple [controller](src/streaming_controller.py) and [executor](src/streaming_executor.py), we simulate a single communication between server and client: server load a `llama-3.2-1b` model, and send to client via three transmission modes: regular, container, and file. This process (clients receiving global model) is often the first stage of a federated learning round, thus the communication burden is realistically reflected.  \n",
+    "\n",
+    "All three settings: regular, container streaming, and file streaming, are integrated in the same script to avoid extra variabilities.\n",
+    "To run the examples:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a152f129-5b3d-4a17-8355-b4627f9d8e72",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! bash regular_transmission.sh\n",
+    "! bash container_stream.sh\n",
+    "! bash file_stream.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8ffbeec-fa29-46ca-b103-02e81ac12cce",
+   "metadata": {},
+   "source": [
+    "We then examine the memory usage by comparing the peak memory usage of the three settings. The results are shown below,\n",
+    "note that the numbers here are the results of one experiment on one machine, and can be highly variable depending on the system and the environment.\n",
+    "\n",
+    "| Setting               | Peak Memory Usage (MB) | Job Finishing Time (s) |\n",
+    "|-----------------------|------------------------|------------------------|\n",
+    "| Regular Transmission  | 42,427                 | 47                     |\n",
+    "| Container Streaming   | 23,265                 | 50                     |\n",
+    "| File Streaming        | 19,176                 | 170                    |\n",
+    "\n",
+    "As shown, the memory usage is significantly reduced by using streaming, especially for file streaming, \n",
+    "while file streaming takes much longer time to finish the job.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3805e71-d929-4d65-9f5d-bc50f799d194",
+   "metadata": {},
+   "source": [
+    "Now that we covered LLM-related features, let's have a [recap](../08.6_recap/recap.ipynb) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "248cb159-59a1-45b2-8dd9-88f653b22511",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/container_stream.sh b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/container_stream.sh
new file mode 100644
index 0000000000..f16893b00a
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/container_stream.sh
@@ -0,0 +1,2 @@
+bash utils/log_memory.sh >>/tmp/nvflare/logs/container.txt &
+python streaming_job.py --retriever_mode container
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/file_stream.sh b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/file_stream.sh
new file mode 100644
index 0000000000..6566124c9f
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/file_stream.sh
@@ -0,0 +1,2 @@
+bash utils/log_memory.sh >>/tmp/nvflare/logs/file.txt &
+python streaming_job.py --retriever_mode file
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/regular_transmission.sh b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/regular_transmission.sh
new file mode 100644
index 0000000000..dd4ef9c091
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/regular_transmission.sh
@@ -0,0 +1,3 @@
+mkdir /tmp/nvflare/logs/
+bash utils/log_memory.sh >>/tmp/nvflare/logs/regular.txt &
+python streaming_job.py
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/src/streaming_controller.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/src/streaming_controller.py
new file mode 100644
index 0000000000..c3f64c1f4d
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/src/streaming_controller.py
@@ -0,0 +1,126 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import torch
+from transformers import AutoModelForCausalLM
+
+from nvflare.apis.controller_spec import Client, ClientTask, Task
+from nvflare.apis.event_type import EventType
+from nvflare.apis.fl_context import FLContext
+from nvflare.apis.impl.controller import Controller
+from nvflare.apis.shareable import Shareable
+from nvflare.apis.signal import Signal
+from nvflare.app_common.streamers.container_retriever import ContainerRetriever
+from nvflare.app_common.streamers.file_retriever import FileRetriever
+
+
+class StreamingController(Controller):
+    def __init__(self, retriever_mode=None, retriever_id=None, task_timeout=200, task_check_period: float = 0.5):
+        Controller.__init__(self, task_check_period=task_check_period)
+        self.retriever_mode = retriever_mode
+        self.retriever_id = retriever_id
+        self.retriever = None
+        self.task_timeout = task_timeout
+
+    def start_controller(self, fl_ctx: FLContext):
+        self.file_name, self.model = self._get_test_model()
+        if self.retriever_mode == "container":
+            self.retriever.add_container("model", self.model)
+
+    def stop_controller(self, fl_ctx: FLContext):
+        pass
+
+    def handle_event(self, event_type: str, fl_ctx: FLContext):
+        # perform initialization and checks
+        if event_type == EventType.START_RUN:
+            engine = fl_ctx.get_engine()
+            if self.retriever_mode:
+                c = engine.get_component(self.retriever_id)
+                if self.retriever_mode == "container":
+                    if not isinstance(c, ContainerRetriever):
+                        self.system_panic(
+                            f"invalid container_retriever {self.retriever_id}, wrong type: {type(c)}",
+                            fl_ctx,
+                        )
+                        return
+                    self.retriever = c
+                elif self.retriever_mode == "file":
+                    if not isinstance(c, FileRetriever):
+                        self.system_panic(
+                            f"invalid file_retriever {self.retriever_id}, wrong type: {type(c)}",
+                            fl_ctx,
+                        )
+                        return
+                    self.retriever = c
+                else:
+                    self.system_panic(
+                        f"invalid retriever_mode {self.retriever_mode}",
+                        fl_ctx,
+                    )
+                    return
+
+    def control_flow(self, abort_signal: Signal, fl_ctx: FLContext):
+        s = Shareable()
+        # set shareable payload
+        if self.retriever_mode == "container":
+            s["model"] = "model"
+        elif self.retriever_mode == "file":
+            s["model"] = self.file_name
+        else:
+            s["model"] = self.model
+        task = Task(name="retrieve_model", data=s, timeout=self.task_timeout)
+        self.broadcast_and_wait(
+            task=task,
+            fl_ctx=fl_ctx,
+            min_responses=1,
+            abort_signal=abort_signal,
+        )
+        client_resps = {}
+        for ct in task.client_tasks:
+            assert isinstance(ct, ClientTask)
+            resp = ct.result
+            if resp is None:
+                resp = "no answer"
+            else:
+                assert isinstance(resp, Shareable)
+                self.log_info(fl_ctx, f"got resp {resp} from client {ct.client.name}")
+                resp = resp.get_return_code()
+            client_resps[ct.client.name] = resp
+        return {"status": "OK", "data": client_resps}
+
+    def process_result_of_unknown_task(
+        self, client: Client, task_name: str, client_task_id: str, result: Shareable, fl_ctx: FLContext
+    ):
+        pass
+
+    @staticmethod
+    def _get_test_model():
+        model_name = "meta-llama/llama-3.2-1b"
+        # load model to dict
+        model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            torch_dtype=torch.float32,
+            device_map="auto",
+            use_cache=False,
+        )
+        params = model.state_dict()
+        for key in params:
+            params[key] = params[key].cpu().numpy()
+
+        # save params dict to a npz file
+        file_name = "model.npz"
+        np.savez(file_name, **params)
+
+        return file_name, params
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/src/streaming_executor.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/src/streaming_executor.py
new file mode 100644
index 0000000000..228db2c226
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/src/streaming_executor.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+
+from nvflare.apis.event_type import EventType
+from nvflare.apis.executor import Executor
+from nvflare.apis.fl_constant import ReturnCode
+from nvflare.apis.fl_context import FLContext
+from nvflare.apis.shareable import Shareable, make_reply
+from nvflare.apis.signal import Signal
+from nvflare.app_common.streamers.container_retriever import ContainerRetriever
+from nvflare.app_common.streamers.file_retriever import FileRetriever
+
+
+class StreamingExecutor(Executor):
+    def __init__(self, retriever_mode=None, retriever_id=None, task_timeout=200):
+        Executor.__init__(self)
+        self.retriever_mode = retriever_mode
+        self.retriever_id = retriever_id
+        self.retriever = None
+        self.task_timeout = task_timeout
+
+    def handle_event(self, event_type: str, fl_ctx: FLContext):
+        # perform initialization and checks
+        if event_type == EventType.START_RUN:
+            engine = fl_ctx.get_engine()
+            if self.retriever_mode:
+                c = engine.get_component(self.retriever_id)
+                if self.retriever_mode == "container":
+                    if not isinstance(c, ContainerRetriever):
+                        self.system_panic(
+                            f"invalid container_retriever {self.retriever_id}, wrong type: {type(c)}",
+                            fl_ctx,
+                        )
+                        return
+                    self.retriever = c
+                elif self.retriever_mode == "file":
+                    if not isinstance(c, FileRetriever):
+                        self.system_panic(
+                            f"invalid file_retriever {self.retriever_id}, wrong type: {type(c)}",
+                            fl_ctx,
+                        )
+                        return
+                    self.retriever = c
+                else:
+                    self.system_panic(
+                        f"invalid retriever_mode {self.retriever_mode}",
+                        fl_ctx,
+                    )
+                    return
+
+    def execute(self, task_name: str, shareable: Shareable, fl_ctx: FLContext, abort_signal: Signal) -> Shareable:
+        self.log_info(fl_ctx, f"got task {task_name}")
+        if task_name == "retrieve_model":
+            model = shareable.get("model")
+            if not model:
+                self.log_error(fl_ctx, "missing model info in request")
+                return make_reply(ReturnCode.BAD_TASK_DATA)
+
+            if self.retriever_mode is None:
+                self.log_info(fl_ctx, f"received container type: {type(model)} size: {len(model)}")
+                return make_reply(ReturnCode.OK)
+            elif self.retriever_mode == "container":
+                rc, result = self.retriever.retrieve_container(
+                    from_site="server",
+                    fl_ctx=fl_ctx,
+                    timeout=self.task_timeout,
+                    name=model,
+                )
+                if rc != ReturnCode.OK:
+                    self.log_error(fl_ctx, f"failed to retrieve {model}: {rc}")
+                    return make_reply(rc)
+                self.log_info(fl_ctx, f"received container type: {type(result)} size: {len(result)}")
+                return make_reply(ReturnCode.OK)
+            elif self.retriever_mode == "file":
+                rc, result = self.retriever.retrieve_file(
+                    from_site="server",
+                    fl_ctx=fl_ctx,
+                    timeout=self.task_timeout,
+                    file_name=model,
+                )
+                if rc != ReturnCode.OK:
+                    self.log_error(fl_ctx, f"failed to retrieve file {model}: {rc}")
+                    return make_reply(rc)
+                # rename the received file to its original name
+                rename_path = os.path.join(os.path.dirname(result), model)
+                os.rename(result, rename_path)
+                self.log_info(fl_ctx, f"received file: {result}, renamed to: {rename_path}")
+                # Load local model
+                result = dict(np.load(rename_path))
+                self.log_info(fl_ctx, f"loaded file content type: {type(result)} size: {len(result)}")
+
+                return make_reply(ReturnCode.OK)
+        else:
+            self.log_error(fl_ctx, f"got unknown task {task_name}")
+            return make_reply(ReturnCode.TASK_UNKNOWN)
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/streaming_job.py b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/streaming_job.py
new file mode 100644
index 0000000000..9569462157
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/streaming_job.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from src.streaming_controller import StreamingController
+from src.streaming_executor import StreamingExecutor
+
+from nvflare import FedJob
+from nvflare.app_common.streamers.container_retriever import ContainerRetriever
+from nvflare.app_common.streamers.file_retriever import FileRetriever
+
+
+def main():
+    args = define_parser()
+    retriever_mode = args.retriever_mode
+
+    # Create the FedJob
+    job = FedJob(name="streaming", min_clients=1)
+
+    if retriever_mode:
+        if retriever_mode == "file":
+            retriever = FileRetriever(source_dir="./", dest_dir="./")
+            job_dir = "/tmp/nvflare/workspace/jobs/file_streaming"
+            work_dir = "/tmp/nvflare/workspace/works/file_streaming"
+        elif retriever_mode == "container":
+            retriever = ContainerRetriever()
+            job_dir = "/tmp/nvflare/workspace/jobs/container_streaming"
+            work_dir = "/tmp/nvflare/workspace/works/container_streaming"
+        else:
+            raise ValueError(f"invalid retriever_mode {retriever_mode}")
+        job.to_server(retriever, id="retriever")
+        job.to_clients(retriever, id="retriever")
+
+        controller = StreamingController(retriever_mode=retriever_mode, retriever_id="retriever")
+        job.to_server(controller)
+
+        executor = StreamingExecutor(retriever_mode=retriever_mode, retriever_id="retriever")
+        job.to_clients(executor, tasks=["*"])
+    else:
+        job_dir = "/tmp/nvflare/workspace/jobs/regular_streaming"
+        work_dir = "/tmp/nvflare/workspace/works/regular_streaming"
+        controller = StreamingController()
+        job.to_server(controller)
+        executor = StreamingExecutor()
+        job.to_clients(executor, tasks=["*"])
+
+    # Export the job
+    print("job_dir=", job_dir)
+    job.export_job(job_dir)
+
+    # Run the job
+    print("workspace_dir=", work_dir)
+    job.simulator_run(work_dir, n_clients=1, threads=1)
+
+
+def define_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--retriever_mode",
+        type=str,
+        default=None,
+        help="Retriever mode, default is None, can be 'container' or 'file'",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/utils/log_memory.sh b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/utils/log_memory.sh
new file mode 100644
index 0000000000..33f5af5944
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_llm_streaming/utils/log_memory.sh
@@ -0,0 +1,9 @@
+#!/bin/bash -e
+
+echo "      date     time $(free -m | grep total | sed -E 's/^    (.*)/\1/g')"
+counter=1
+while [ $counter -le 400 ]; do
+    echo "$(date '+%Y-%m-%d %H:%M:%S') $(free -m | grep Mem: | sed 's/Mem://g')"
+    sleep 0.5
+    ((counter++))
+done
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_retiever_model_training/federated_retriever_model_training.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_retiever_model_training/federated_retriever_model_training.ipynb
deleted file mode 100644
index 0e85096e61..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.5_retiever_model_training/federated_retriever_model_training.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0aeea5cc-b56e-4c19-9e2e-7100451fceea",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.6_llm_quantization/LLM_quantization.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.6_llm_quantization/LLM_quantization.ipynb
deleted file mode 100644
index 513cd88e27..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.6_llm_quantization/LLM_quantization.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0b6bf4f1-bb20-40f8-a397-47f677ac3c59",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.6_recap/recap.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.6_recap/recap.ipynb
new file mode 100644
index 0000000000..c7950c1afd
--- /dev/null
+++ b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.6_recap/recap.ipynb
@@ -0,0 +1,77 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0b9e326e-1d97-45c5-ac54-6bf581e4223f",
+   "metadata": {},
+   "source": [
+    "# Summary of Chapter 8"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd1c7385-dd22-4d5c-8d2e-7a73f0b3ac2d",
+   "metadata": {},
+   "source": [
+    "In this chapter, we visited NVFlare's offerings in enabling efficient and robust federated training of language models, especially in the era of LLMs.\n",
+    "\n",
+    "Specifically, the following items have been covered:\n",
+    "1. **[Federated NLP with BERT Model](../08.1_fed_bert/federated_nlp_with_bert.ipynb)**: task-specific model training with BERT in a \n",
+    "2. **[Federated LLM Tuning with SFT](../08.2_llm_sft/LLM_SFT.ipynb)**: supervised Fine-Tuning and its role in adapting LLMs in federated learning\n",
+    "3. **[Federated LLM Tuning with PEFT](../08.3_llm_peft/LLM_PEFT.ipynb)**: PEFT in adapting LLMs for specific tasks, which can be achieve in a federated setting\n",
+    "4. **[Model Quantization for Transmission](../08.4_llm_quantization/LLM_quantization.ipynb)**: reduce the message size with quantization methods so as to address the significant communication burden when performing federated LLM learning with SFT. \n",
+    "5. **[Message Streaming for Model Transmission](../08.5_llm_streaming/LLM_streaming.ipynb)**: with quantization reducing communication cost, system memory requirement is still high for prepareing the message on either side. Therefore, we enabled streaming capabilities for more efficient and robust model communication."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4f3f3bf8",
+   "metadata": {},
+   "source": [
+    "Key takeaways of this section are:\n",
+    "1. NVFlare enables federated training of language models, from BERT to most recent LLMs, under popular training schemes of both SFT and PEFT.\n",
+    "2. NVFlare enables efficient and robust communications, accounting for both message transmission and local memory requirements, such that the resource can be best utilized in real-life applications."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a32effe-3b9d-4cd8-b53c-f91907de9d95",
+   "metadata": {},
+   "source": [
+    "With NVFlare, popular training schemes widely used in the LLM domain can be easily adopted to federated learning paradigm, unleasing more posibilities.\n",
+    "\n",
+    "Now let's move on to the [Chapter 9](../../chapter-9_flare_low_level_apis/09.0_introduction/introduction.ipynb).\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4321c7c-d56c-49b9-89a2-c503290b8232",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.7_llm_streaming/LLM_streaming.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.7_llm_streaming/LLM_streaming.ipynb
deleted file mode 100644
index deef21b826..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.7_llm_streaming/LLM_streaming.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b9c89570-59af-415d-bcb7-a59b304eb49e",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.8_recap/recap.ipynb b/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.8_recap/recap.ipynb
deleted file mode 100644
index 63414e9f32..0000000000
--- a/examples/tutorials/self-paced-training/part-4_advanced_federated_learning/chapter-8_federated_LLM_training/08.8_recap/recap.ipynb
+++ /dev/null
@@ -1,33 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ea637265-6cbb-4e74-9ab3-8eb884991d20",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nvflare_example",
-   "language": "python",
-   "name": "nvflare_example"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}