Skip to content

Commit b88646f

Browse files
authoredOct 6, 2021
Remove Boston housing dataset (aws#2959)
* Template - remove Boston, fix typos * SM script mode - remove Boston mention * KMS - switch Boston to California * KMS - cite California * Pipe BYO - remove Boston, use current region * Sklearn end2end - remove Boston, update features * 011 Ingest Data - remove Boston * 02 Ingest Data - remove Boston * 02 03 Ingest data - remove Boston housing, fix Redshift/Athena * Code formatting * Ingest Data - update index rst file * Ingest data Redshift - use conda_python3 kernel * Fix docker auth * Handling KMS - Rename kms_key
1 parent cada716 commit b88646f

File tree

12 files changed

+868
-282
lines changed

12 files changed

+868
-282
lines changed
 

‎advanced_functionality/handling_kms_encrypted_data/handling_kms_encrypted_data.ipynb

+26-19
Original file line numberDiff line numberDiff line change
@@ -67,14 +67,15 @@
6767
"import numpy as np\n",
6868
"import re\n",
6969
"from sagemaker import get_execution_role\n",
70+
"import sagemaker\n",
7071
"\n",
7172
"region = boto3.Session().region_name\n",
7273
"\n",
7374
"role = get_execution_role()\n",
7475
"\n",
75-
"kms_key_arn = \"<your-kms-key-arn>\"\n",
76+
"kms_key = \"<your-kms-key-arn>\"\n",
7677
"\n",
77-
"bucket = \"<s3-bucket>\" # put your s3 bucket name here, and create s3 bucket\n",
78+
"bucket = sagemaker.Session().default_bucket()\n",
7879
"prefix = \"sagemaker/DEMO-kms\"\n",
7980
"# customize to your bucket where you have stored the data\n",
8081
"bucket_path = \"s3://{}\".format(bucket)"
@@ -90,7 +91,11 @@
9091
"\n",
9192
"### Data ingestion\n",
9293
"\n",
93-
"We, first, read the dataset from an existing repository into memory. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as the one used below, reading into memory isn't onerous, though it would be for larger datasets."
94+
"We, first, read the dataset from an existing repository into memory. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as the one used below, reading into memory isn't onerous, though it would be for larger datasets.\n",
95+
"\n",
96+
"This example uses the California Housing dataset, initially published in:\n",
97+
"\n",
98+
"> Pace, R. Kelley, and Ronald Barry. \"Sparse spatial autoregressions.\" Statistics & Probability Letters 33.3 (1997): 291-297."
9499
]
95100
},
96101
{
@@ -99,16 +104,16 @@
99104
"metadata": {},
100105
"outputs": [],
101106
"source": [
102-
"from sklearn.datasets import load_boston\n",
107+
"from sklearn.datasets import fetch_california_housing\n",
103108
"\n",
104-
"boston = load_boston()\n",
105-
"X = boston[\"data\"]\n",
106-
"y = boston[\"target\"]\n",
107-
"feature_names = boston[\"feature_names\"]\n",
109+
"california = fetch_california_housing()\n",
110+
"X = california[\"data\"]\n",
111+
"y = california[\"target\"]\n",
112+
"feature_names = california[\"feature_names\"]\n",
108113
"data = pd.DataFrame(X, columns=feature_names)\n",
109114
"target = pd.DataFrame(y, columns={\"MEDV\"})\n",
110115
"data[\"MEDV\"] = y\n",
111-
"local_file_name = \"boston.csv\"\n",
116+
"local_file_name = \"california_housing.csv\"\n",
112117
"data.to_csv(local_file_name, header=False, index=False)"
113118
]
114119
},
@@ -140,7 +145,7 @@
140145
"outputs": [],
141146
"source": [
142147
"def write_file(X, y, fname, include_labels=True):\n",
143-
" feature_names = boston[\"feature_names\"]\n",
148+
" feature_names = california[\"feature_names\"]\n",
144149
" data = pd.DataFrame(X, columns=feature_names)\n",
145150
" if include_labels:\n",
146151
" data.insert(0, \"MEDV\", y)\n",
@@ -180,7 +185,7 @@
180185
"\n",
181186
"data_train = open(train_file, \"rb\")\n",
182187
"key_train = \"{}/train/{}\".format(prefix, train_file)\n",
183-
"kms_key_id = kms_key_arn.split(\":key/\")[1]\n",
188+
"kms_key_id = kms_key.split(\":key/\")[1]\n",
184189
"\n",
185190
"print(\"Put object...\")\n",
186191
"s3.put_object(\n",
@@ -227,7 +232,7 @@
227232
"source": [
228233
"## Training the SageMaker XGBoost model\n",
229234
"\n",
230-
"Now that we have our data in S3, we can begin training. We'll use Amazon SageMaker XGboost algorithm as an example to demonstrate model training. Note that nothing needs to be changed in the way you'd call the training algorithm. The only requirement for training to succeed is that the IAM role (`role`) used for S3 access has permissions to encrypt and decrypt data with the KMS key (`kms_key_arn`). You can set these permissions using the instructions [here](http://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users). If the permissions aren't set, you'll get the `Data download failed` error. Specify a `VolumeKmsKeyId` in the training job parameters to have the volume attached to the ML compute instance encrypted using key provided."
235+
"Now that we have our data in S3, we can begin training. We'll use Amazon SageMaker XGboost algorithm as an example to demonstrate model training. Note that nothing needs to be changed in the way you'd call the training algorithm. The only requirement for training to succeed is that the IAM role (`role`) used for S3 access has permissions to encrypt and decrypt data with the KMS key (`kms_key`). You can set these permissions using the instructions [here](http://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users). If the permissions aren't set, you'll get the `Data download failed` error. Specify a `VolumeKmsKeyId` in the training job parameters to have the volume attached to the ML compute instance encrypted using key provided."
231236
]
232237
},
233238
{
@@ -236,9 +241,11 @@
236241
"metadata": {},
237242
"outputs": [],
238243
"source": [
239-
"from sagemaker.amazon.amazon_estimator import get_image_uri\n",
244+
"from sagemaker import image_uris\n",
240245
"\n",
241-
"container = get_image_uri(boto3.Session().region_name, \"xgboost\")"
246+
"container = image_uris.retrieve(\n",
247+
" region=boto3.Session().region_name, framework=\"xgboost\", version=\"latest\"\n",
248+
")"
242249
]
243250
},
244251
{
@@ -262,7 +269,7 @@
262269
" \"InstanceCount\": 1,\n",
263270
" \"InstanceType\": \"ml.m4.4xlarge\",\n",
264271
" \"VolumeSizeInGB\": 5,\n",
265-
" \"VolumeKmsKeyId\": kms_key_arn,\n",
272+
" \"VolumeKmsKeyId\": kms_key,\n",
266273
" },\n",
267274
" \"TrainingJobName\": job_name,\n",
268275
" \"HyperParameters\": {\n",
@@ -379,7 +386,7 @@
379386
"print(endpoint_config_name)\n",
380387
"create_endpoint_config_response = client.create_endpoint_config(\n",
381388
" EndpointConfigName=endpoint_config_name,\n",
382-
" KmsKeyId=kms_key_arn,\n",
389+
" KmsKeyId=kms_key,\n",
383390
" ProductionVariants=[\n",
384391
" {\n",
385392
" \"InstanceType\": \"ml.m4.xlarge\",\n",
@@ -509,7 +516,7 @@
509516
"metadata": {},
510517
"source": [
511518
"## Run batch prediction using batch transform\n",
512-
"Create a transform job to do batch prediction using the trained model. Similar to the training section above, the execution role assumed by this notebook must have permissions to encrypt and decrypt data with the KMS key (`kms_key_arn`) used for S3 server-side encryption. Similar to training, specify a `VolumeKmsKeyId` so that the volume attached to the transform instance is encrypted using the key provided."
519+
"Create a transform job to do batch prediction using the trained model. Similar to the training section above, the execution role assumed by this notebook must have permissions to encrypt and decrypt data with the KMS key (`kms_key`) used for S3 server-side encryption. Similar to training, specify a `VolumeKmsKeyId` so that the volume attached to the transform instance is encrypted using the key provided."
513520
]
514521
},
515522
{
@@ -542,7 +549,7 @@
542549
" \"TransformResources\": {\n",
543550
" \"InstanceCount\": 1,\n",
544551
" \"InstanceType\": \"ml.c4.xlarge\",\n",
545-
" \"VolumeKmsKeyId\": kms_key_arn,\n",
552+
" \"VolumeKmsKeyId\": kms_key,\n",
546553
" },\n",
547554
"}\n",
548555
"\n",
@@ -605,7 +612,7 @@
605612
"name": "python",
606613
"nbconvert_exporter": "python",
607614
"pygments_lexer": "ipython3",
608-
"version": "3.6.2"
615+
"version": "3.6.13"
609616
},
610617
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
611618
},

‎advanced_functionality/pipe_bring_your_own/Dockerfile

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
ARG region
2+
13
# SageMaker PyTorch image
2-
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04
4+
FROM 763104351884.dkr.ecr.${region}.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04
35

46
ENV PATH="/opt/ml/code:${PATH}"
57

‎advanced_functionality/pipe_bring_your_own/pipe_bring_your_own.ipynb

+13-8
Original file line numberDiff line numberDiff line change
@@ -46,10 +46,14 @@
4646
"\n",
4747
"Let's start by specifying:\n",
4848
"\n",
49-
"- S3 URIs `s3_training_input` and `s3_model_output` that you want to use for training input and model data respectively. These should be within the same region as the Notebook Instance, training, and hosting. Since the \"algorithm\" you're building here doesn't really have any specific data-format, feel free to point `s3_training_input` to any s3 dataset you have, the bigger the dataset the better to test the raw IO throughput performance. For this example, the Boston Housing dataset will be copied over to your s3 bucket.\n",
49+
"- S3 URIs `s3_training_input` and `s3_model_output` that you want to use for training input and model data respectively. These should be within the same region as the Notebook Instance, training, and hosting. Since the \"algorithm\" you're building here doesn't really have any specific data-format, feel free to point `s3_training_input` to any s3 dataset you have, the bigger the dataset the better to test the raw IO throughput performance. For this example, the California Housing dataset will be copied over to your s3 bucket.\n",
5050
"- The `training_instance_type` to use for training. More powerful instance types have more CPU and bandwidth which would result in higher throughput.\n",
5151
"- The IAM role arn used to give training access to your data.\n",
5252
"\n",
53+
"The California Housing dataset was originally published in:\n",
54+
"\n",
55+
"> Pace, R. Kelley, and Ronald Barry. \\\"Sparse spatial autoregressions.\\\" Statistics & Probability Letters 33.3 (1997): 291-297.\n",
56+
"\n",
5357
"### Permissions\n",
5458
"\n",
5559
"Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because you'll be creating a new repository in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately."
@@ -67,8 +71,7 @@
6771
"import pandas as pd\n",
6872
"import sagemaker\n",
6973
"\n",
70-
"# to load the boston housing dataset\n",
71-
"from sklearn.datasets import *\n",
74+
"from sklearn.datasets import fetch_california_housing\n",
7275
"\n",
7376
"# Get SageMaker session & default S3 bucket\n",
7477
"role = sagemaker.get_execution_role()\n",
@@ -110,9 +113,9 @@
110113
"metadata": {},
111114
"outputs": [],
112115
"source": [
113-
"filename = \"boston_house.csv\"\n",
116+
"filename = \"california_housing.csv\"\n",
114117
"# Download files from sklearns.datasets\n",
115-
"tabular_data = load_boston()\n",
118+
"tabular_data = fetch_california_housing()\n",
116119
"tabular_data_full = pd.DataFrame(tabular_data.data, columns=tabular_data.feature_names)\n",
117120
"tabular_data_full[\"target\"] = pd.DataFrame(tabular_data.target)\n",
118121
"tabular_data_full.to_csv(filename, index=False)"
@@ -198,7 +201,9 @@
198201
"outputs": [],
199202
"source": [
200203
"%%sh\n",
201-
"aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com"
204+
"REGION=$(aws configure get region)\n",
205+
"account=$(aws sts get-caller-identity --query Account --output text)\n",
206+
"aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${REGION}.amazonaws.com"
202207
]
203208
},
204209
{
@@ -215,7 +220,7 @@
215220
"outputs": [],
216221
"source": [
217222
"%%sh\n",
218-
"docker build -t pipe_bring_your_own ."
223+
"docker build -t pipe_bring_your_own . --build-arg region=$(aws configure get region)"
219224
]
220225
},
221226
{
@@ -319,7 +324,7 @@
319324
"name": "python",
320325
"nbconvert_exporter": "python",
321326
"pygments_lexer": "ipython3",
322-
"version": "3.6.10"
327+
"version": "3.6.13"
323328
},
324329
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
325330
},

0 commit comments

Comments
 (0)