Restructure the compile-infer tutorials based on feedbacks; expanded NCG intro

jeffhataws · jeffhataws · commit bec33c18a05f · 2019-12-03T00:37:50.000-08:00
diff --git a/docs/mxnet-neuron/tutorial-compile-infer.md b/docs/mxnet-neuron/tutorial-compile-infer.md
@@ -5,34 +5,27 @@ Neuron supports both Python module and Symbol APIs and the C predict API. The fo
 ## Steps Overview:
 
 1. Launch an EC2 instance for compilation and/or inference
-2. Install Neuron for Compiler and Runtime execution
-3. Run Example
-    1. Compile
-    2. Execute inference on Inf1
+2. Install Neuron for compilation and runtime execution
+3. Compile on compilation server
+4. Execute inference on Inf1
 
 ## Step 1: Launch EC2 Instances
 
-A typical workflow with the Neuron SDK will be for a trained ML model to be compiled on a compilation server and then the artifacts distributed to the (fleet of) Inf1 instances for execution. Neuron enables MXNet to be used for all of these steps.
+A typical workflow with the Neuron SDK will be to compile trained ML models on a compilation server and then distribute the artifacts to a fleet of Inf1 instances for execution. Neuron enables MXNet to be used for all of these steps.
 
+1. Select an AMI of your choice, which may be Ubuntu 16.x, Ubuntu 18.x, Amazon Linux 2 based. To use a pre-built Deep Learning AMI, which includes all of the needed packages, see [Launching and Configuring a DLAMI](https://docs.aws.amazon.com/dlami/latest/devguide/launch-config.html)
+2. Select and launch an EC2 instance of your choice to compile. Launch an instance by following [EC2 instructions](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance).
+    * It is recommended to use c5.4xlarge or larger. For this example we will use a c5.4xlarge.
+    * If you would like to compile and infer on the same machine, please select inf1.6xlarge.
+3. Select and launch an Inf1 instance of your choice if not compiling and inferencing on the same instance. Launch an instance by following [EC2 instructions](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance).
 
-1. Select an AMI of your choice, which may be Ubuntu 16.x, Ubuntu 18.x, Amazon Linux 2 based. To use a pre-built Deep Learning AMI, which includes all of the needed packages, see these instructions: https://docs.aws.amazon.com/dlami/latest/devguide/launch-config.html
-2. Select and start an EC2 instance of your choice to compile
-    1. It is recommended to use C5.4xlarge or larger. For this example we will use a C5.4xlarge
-    2. If you would like to compile and infer on the same machine, please select Inf1.6xlarge
-3. Select and start an Inf1 instance of your choice to run the compiled model you sdtaretd in step 2.2 to run the compiled model.
+## Step 2: Install Neuron Compiler and MXNet-Neuron On Compilation Instance
 
-## Step 2: Install Neuron
+If using DLAMI, activate aws_neuron_mxnet_p36 environment and skip this step.
 
-If using DLAMI and aws_neuron_mxnet_p36 environment, you can skip to Step 3.
+On the instance you are going to use for compilation, install both Neuron Compiler and  MXNet-Neuron.
 
-### Compiler Instance: Install Neuron Compiler and MXnet-Neuron
-
-On the instance you are going to use for compilation, you must have both the Neuron Compiler and the MXNet-Neuron installed. (The inference instance must have the MXNet-Neuron and the Neuron Runtime installed.)
-Steps Overview:
-
-#### Using Virtualenv:
-
-1. Install virtualenv if needed:
+2.1. Install virtualenv if needed:
 ```bash
 # Ubuntu
 sudo apt-get update
@@ -44,32 +37,32 @@ sudo yum update
 sudo yum install -y python3
 pip3 install --user virtualenv
 ```
-2. Setup a new Python 3.6 environment:
+2.2. Setup a new Python 3.6 environment:
 ```bash
 virtualenv --python=python3.6 test_env_p36
 source test_env_p36/bin/activate
 ```
-3. Modify Pip repository configurations to point to the Neuron repository.
+2.3. Modify Pip repository configurations to point to the Neuron repository.
 ```bash
 tee $VIRTUAL_ENV/pip.conf > /dev/null <<EOF
 [global]
 extra-index-url = https://pip.repos.neuron.amazonaws.com
 EOF
 ```
-4. Install MxNet-Neuron and Neuron Compiler
+2.4. Install MXNet-Neuron and Neuron Compiler
 ```bash
-pip install neuron-cc[mxnet]
 pip install mxnet-neuron
 ```
+```bash
+# can be skipped on inference-only instance
+pip install neuron-cc[mxnet]
+```
 
-### Inference Instance: Install MXNet-Neuron and Neuron-Runtime
-
-1. Same as above to install MXNet-Neuron
-2. To install Runtime, see [Getting started: Installing and Configuring Neuron-RTD](./../neuron-runtime/nrt_start.md).
+## Step 3: Compile on Compilation Server
 
-## Step 3: Run Example
+Model must be compiled to Inferentia target before it can run on Inferentia.
 
-1. Create a file `compile_resnet50.py` with the content below and run it using `python compile_resnet50.py`. Compilation will take a few minutes on c5.4xlarge. At the end of compilation, the files `resnet-50_compiled-0000.params` and `resnet-50_compiled-symbol.json` will be created in local directory.
+3.1. Create a file `compile_resnet50.py` with the content below and run it using `python compile_resnet50.py`. Compilation will take a few minutes on c5.4xlarge. At the end of compilation, the files `resnet-50_compiled-0000.params` and `resnet-50_compiled-symbol.json` will be created in local directory.
 
 ```python
 import mxnet as mx
@@ -88,13 +81,25 @@ sym, args, aux = mx.contrib.neuron.compile(sym, args, aux, inputs)
 mx.model.save_checkpoint("resnet-50_compiled", 0, sym, args, aux)
 ```
 
-2. If not compiling and inferring on the same instance, copy the artifact to the inference server (use ec2-user as user for AML2):
+3.2. If not compiling and inferring on the same instance, copy the artifact to the inference server (use ec2-user as user for AML2):
 ```bash
 scp -i <PEM key file>  resnet-50_compiled-0000.params ubuntu@<instance DNS>:~/  # Ubuntu
 scp -i <PEM key file>  resnet-50_compiled-symbol.json ubuntu@<instance DNS>:~/  # Ubuntu
 ```
 
-3. On the Inf1, create a inference Python script named `infer_resnet50.py` with the following content:
+## Step 4: Install MXNet-Neuron and Neuron-Runtime on Inference Instance
+
+If using DLAMI, activate aws_neuron_mxnet_p36 environment and skip this step.
+
+4.1. Follow Step 2 above to install MXNet-Neuron.
+ * Install neuron-cc if compilation on inference instance is desired (see notes above on recommended Inf1 sizes for compilation)
+ * Skip neuron-cc if compilation is not done on inference instance
+
+4.2. To install Runtime, see [Getting started: Installing and Configuring Neuron-RTD](./../neuron-runtime/nrt_start.md).
+
+## Step 5: Execute inference on Inf1
+
+5.1. On the Inf1, create a inference Python script named `infer_resnet50.py` with the following content:
 ```python
 import mxnet as mx
 import numpy as np
@@ -130,7 +135,7 @@ for i in a[0:5]:
      print('probability=%f, class=%s' %(prob[i], labels[i]))
 ```
 
-4. Run the script to see inference results:
+5.2. Run the script to see inference results:
 ```bash
 python infer_resnet50.py
 ```
diff --git a/docs/mxnet-neuron/tutorial-model-serving.md b/docs/mxnet-neuron/tutorial-model-serving.md
@@ -2,9 +2,9 @@
 
 This Neuron MXNet Model Serving (MMS) example is adapted from the MXNet vision service example which uses pretrained squeezenet to perform image classification: https://github.com/awslabs/mxnet-model-server/tree/master/examples/mxnet_vision.
 
-Before starting this example, please ensure that Neuron-optimized MXNet version mxnet-neuron is installed (see [MXNet Tutorial](./tutorial-compile-infer.md)) and Neuron RTD is running with default settings (see [Neuron Runtime getting started](./../neuron-runtime/nrt_start.md) ).
+Before starting this example, please ensure that Neuron-optimized MXNet version mxnet-neuron is installed along with Neuron Compiler (see [MXNet Tutorial](./tutorial-compile-infer.md)) and Neuron RTD is running with default settings (see [Neuron Runtime getting started](./../neuron-runtime/nrt_start.md) ).
 
-If using DLAMI and aws_neuron_mxnet_p36 environment, you can skip the installation part in the first step below.
+If using DLAMI, you can activate the environment aws_neuron_mxnet_p36 and skip the installation part in the first step below.
 
 1. First, install Java runtime and mxnet-model-server:
 
@@ -91,14 +91,14 @@ Also, comment out unnecessary data copy for model_input in `mxnet_model_service.
 #model_input = [item.as_in_context(self.mxnet_ctx) for item in model_input]
 ```
 
-6. Package the model with model-archiver
+6. Package the model with model-archiver:
 
 ```bash
 cd ~/mxnet-model-server/examples
 model-archiver --force --model-name resnet-50_compiled --model-path mxnet_vision --handler mxnet_vision_service:handle
 ```
 
-7. Start MXNet Model Server (MMS) and load model using RESTful API. The number of workers should be less than or equal number of NeuronCores divided by the number of NeuronCores required by model (<link to API>). Please ensure that Neuron RTD is running with default settings (see Getting Started guide):
+7. Start MXNet Model Server (MMS) and load model using RESTful API. Please ensure that Neuron RTD is running with default settings (see [Neuron Runtime getting started](./../neuron-runtime/nrt_start.md)):
 
 ```bash
 cd ~/mxnet-model-server/
@@ -108,6 +108,8 @@ curl -v -X POST "http://localhost:8081/models?initial_workers=1&max_workers=1&sy
 sleep 10 # allow sufficient time to load model
 ```
 
+Each worker requires NeuronCore Group that can accommodate the compiled model. Additional workers can be added by increasing max_workers configuration as long as there are enough NeuronCores available. Use `neuron-cli list-ncg` to see NeuronCore Groups being created.
+
 8. Test inference using an example image:
 
 ```bash
diff --git a/docs/mxnet-neuron/tutorial-neuroncore-groups.md b/docs/mxnet-neuron/tutorial-neuroncore-groups.md
@@ -1,6 +1,8 @@
 # Tutorial: MXNet Configurations for NeuronCore Groups
 
-To further subdivide the pool of NeuronCores controled by a Neuron-RTD, specify the NeuronCore Groups within that pool using the environment variable `NEURONCORE_GROUP_SIZES`  set to list of group sizes. The consecutive NeuronCore groups will be created by Neuron-RTD and be available for use to map the models.
+A NeuronCore Group is a set of NeuronCores that are used to load and run compiled models. At any time, one model will be running in a NeuronCore Group. By changing to a different sized NeuronCore Group and then creating several of these NeuronCore Groups, a user may create independent and parallel models running in the Inferentia. Additionally, within a NeuronCore Group, loaded models can be dynamically started and stopped, allowing for dynamic context switching from one model to another.
+
+To explicitly specify the NeuronCore Groups, set environment variable `NEURONCORE_GROUP_SIZES` to a list of group sizes. The consecutive NeuronCore groups will be created by Neuron-RTD and be available for user to map the models.
 
 Note that to map a model to a group, the model must be compiled to fit within the group size. To limit the number of NeuronCores during compilation, use compiler_args dictionary with field “--num-neuroncores“ set to the group size:
 
@@ -9,6 +11,12 @@ compile_args = {'--num-neuroncores' : 2}
 sym, args, auxs = neuron.compile(sym, args, auxs, inputs, **compile_args)
 ```
 
+Before starting this example, please ensure that Neuron-optimized MXNet version mxnet-neuron is installed along with Neuron Compiler (see [MXNet Tutorial](./tutorial-compile-infer.md)) and Neuron RTD is running with default settings (see [Neuron Runtime getting started](./../neuron-runtime/nrt_start.md) ).
+
+## Compile Model
+
+Model must be compiled to Inferentia target before it can run on Inferentia.
+
 Create compile_resnet50.py with `--num-neuroncores` set to 2 and run it. The files `resnet-50_compiled-0000.params` and `resnet-50_compiled-symbol.json` will be created in local directory:
 
 ```python
@@ -30,13 +38,17 @@ mx.model.save_checkpoint("resnet-50_compiled", 0, sym, args, aux)
 
 ```
 
+## Run Inference
+
 During inference, to subdivide the pool of one Inferentia into groups of 1, 2, and 1 NeuronCores, specify `NEURONCORE_GROUP_SIZES` as follows:
 
 ```bash
 NEURONCORE_GROUP_SIZES='[1,2,1]' <launch process>`
 ```
 
-Within the framework, the model can be mapped to group using  `ctx=mx.neuron(N)` context where N is the group index within the `NEURONCORE_GROUP_SIZES` list. Create infer_resnet50.py with the following content:
+Within the framework, the model can be mapped to group using  `ctx=mx.neuron(N)` context where N is the group index within the `NEURONCORE_GROUP_SIZES` list.
+
+Create infer_resnet50.py with the following content:
 
 ```python
 import mxnet as mx
diff --git a/docs/tensorflow-neuron/tutorial-NeuronCore-Group.md b/docs/tensorflow-neuron/tutorial-NeuronCore-Group.md
@@ -1,10 +1,10 @@
 # Tutorial: Configuring NeuronCore Groups
 
-A NeuronCore Group is a set of NeuronCores that are used to load and run compiled models. At any time, one model will be running in a NeuronCoreGroup. By changing to a different sized NeuonCoreGroup and then creating several of these NeuronCoreGroups, a user may create independent and parallel models running in the Inferentia. Additonally: within a NeuronCoreGroup, loaded models can be dynamically started and stopped, allowing for dynamic context switching from one model to another. By default, a single NeuronCoreGroup is created by Neuron Runtime that contains all 4 NeuronCores in an Inferentia. In this default case, when models are loaded to that default NeuronCoreGroup, only 1 will be running at any time. By configuring multiple NeuronCoreGroups as shown in this tutorial, multiple models may be made to run simultaenously.
+A NeuronCore Group is a set of NeuronCores that are used to load and run compiled models. At any time, one model will be running in a NeuronCore Group. By changing to a different sized NeuronCore Group and then creating several of these NeuronCore Groups, a user may create independent and parallel models running in the Inferentia. Additionally, within a NeuronCore Group, loaded models can be dynamically started and stopped, allowing for dynamic context switching from one model to another. By default, a single NeuronCoreGroup is created by Neuron Runtime that contains all four NeuronCores in an Inferentia. In this default case, when models are loaded to that default NeuronCore Group, only one will be running at any time. By configuring multiple NeuronCore Groups as shown in this tutorial, multiple models may be made to run simultaneously.
 
-The NEURONCORE_GROUP_SIZES environment variable provides user control over this in Neuron-integrated TensorFlow. By default, TensorFlow-Neuron will choose the optimal utilization mode based on model metadata, but in some cases manually setting NEURONCORE_GROUP_SIZES can provide additional performance benefits.
+The NEURONCORE_GROUP_SIZES environment variable provides user control over the grouping of NeuronCores in Neuron-integrated TensorFlow. By default, TensorFlow-Neuron will choose the optimal utilization mode based on model metadata, but in some cases manually setting NEURONCORE_GROUP_SIZES can provide additional performance benefits.
 
-In this tutorial you will learn how to enable a NeuronCore group running TensorFlow Resnet-50 model
+In this tutorial you will learn how to enable a NeuronCore Group running TensorFlow Resnet-50 model.
 
 ## Steps Overview:
 
@@ -61,7 +61,7 @@ python infer_resnet50.py
 
 Scenario 1: allow tensorflow-neuron to utilize more than one Inferentia on inf1.6xlarge and inf1.24xlarge instance sizes.
 
-By default, one Python process with tensorflow-neuron or one tensorflow_model_server_neuron process tries to allocate all NeuronCores in an Inferentia from the Neuron Runtime Daemon. To utilize multiple Inferentias, the recommended parallelization mode is process-level parallelization, as it bypasses the overhead of Python and tensorflow_model_server_neuron resource handling as well as Python’s global interpreter lock (GIL). Note that TensorFlow’s session.run function actually does not hold the GIL. 
+By default, one Python process with tensorflow-neuron or one tensorflow_model_server_neuron process tries to allocate all NeuronCores in an Inferentia from the Neuron Runtime Daemon. To utilize multiple Inferentias, the recommended parallelization mode is process-level parallelization, as it bypasses the overhead of Python and tensorflow_model_server_neuron resource handling as well as Python’s global interpreter lock (GIL). Note that TensorFlow’s session.run function actually does not hold the GIL.
 
 When there is a need to allocate more Inferentia compute into a single process, the following example shows the usage:
 
@@ -133,8 +133,3 @@ result_list = [predictor(feed) for feed in model_feed_dict_list]
 
 # inference results can be found in result_list
 ```
-
-
-
-
-
diff --git a/docs/tensorflow-neuron/tutorial-compile-infer.md b/docs/tensorflow-neuron/tutorial-compile-infer.md