update main readme + moe readme

Vicba · Sep 3, 2024 · cfc69e3 · cfc69e3
1 parent 5354f74
commit cfc69e3
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -36,11 +36,12 @@ To get started with the code in this repository, follow these steps:
 
 Here's a list of the white papers and their corresponding implementations available in this repository:
 
-- **["Attention is all you need" paper](https://arxiv.org/abs/1706.03762)**: Introduces the Transformer model, which uses self-attention to focus on important words in a sentence, making it faster and better at understanding long sentences compared to older models.
-- **["An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"](https://arxiv.org/abs/2010.11929)**: Visual transformer (ViT) is a type of neural network that uses self-attention to process images for classification tasks.
-
+- **["Attention is All You Need" paper](https://arxiv.org/abs/1706.03762)**: Introduces the Transformer model, which uses self-attention to focus on important words in a sentence, making it faster and better at understanding long sentences compared to older models.
+- **["An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"](https://arxiv.org/abs/2010.11929)**: Presents the Visual Transformer (ViT), a neural network that applies self-attention to image classification tasks.
+- **Mixture of Experts (various papers)**: Describes a neural network architecture that uses multiple specialized models (experts) and a gating mechanism to improve performance and adaptability by activating only the most relevant experts for a given task.
 Each implementation is located in its own directory, which includes:
 
+Each one contains:
 - **`README.md`**: Detailed instructions on how to use the code for that specific paper.
 <!-- - **`main.py`**: The main script to run the implementation. -->
 - **`requirements.txt`**: Python dependencies required for that implementation.

diff --git a/moe/README.md b/moe/README.md
@@ -0,0 +1,14 @@
+# Mixture Of Experts (MoE)
+
+![MoE](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe/00_switch_transformer.png)
+
+Enhances performance by utilizing a set of specialized models, or "experts," for different tasks. This framework allows only a subset of experts to be activated for any given input, optimizing computational resources and improving efficiency.
+
+> Note: my code is a simplified version, as it does not add/handle noise,randomness, etc.
+
+Resources:
+- https://huggingface.co/blog/moe
+- https://cameronrwolfe.substack.com/p/conditional-computation-the-birth
+- https://github.com/lucidrains/mixture-of-experts
+- https://www.youtube.com/watch?v=0U_65fLoTq0
+