AutoGluon Marketplace: changed the data set (aws#2935)

yoheigon · web-flow · commit d58b3fe45a28 · 2021-10-01T16:15:28.000-05:00
* changed the data set

* changed the download method
diff --git a/aws_marketplace/using_algorithms/autogluon/autogluon_tabular_marketplace.ipynb b/aws_marketplace/using_algorithms/autogluon/autogluon_tabular_marketplace.ipynb
@@ -105,26 +105,16 @@
     "algorithm_arn = AlgorithmArnProvider.get_algorithm_arn(region)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import subprocess\n",
-    "\n",
-    "subprocess.run(\"apt-get update -y\", shell=True)\n",
-    "subprocess.run(\"apt install unzip\", shell=True)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Step 3: Get the data\n",
     "\n",
-    "In this example we'll use the direct-marketing dataset to build a binary classification model that predicts whether customers will accept or decline a marketing offer.  \n",
-    "First we'll download the data and split it into train and test sets. AutoGluon does not require a separate validation set (it uses bagged k-fold cross-validation)."
+    "In this example we'll use the [1] [UCI Machine Learning Repository: Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult) to build a binary classification model that predicts whether customers will accept or decline a marketing offer.  \n",
+    "First we'll download the data and split it into train and test sets. AutoGluon does not require a separate validation set (it uses bagged k-fold cross-validation).\n",
+    "\n",
+    "[1] Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science."
    ]
   },
   {
@@ -133,23 +123,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Download and unzip the data\n",
-    "subprocess.run(\n",
-    "    f\"aws s3 cp --region {region} s3://sagemaker-sample-data-{region}/autopilot/direct_marketing/bank-additional.zip .\",\n",
-    "    shell=True,\n",
-    ")\n",
-    "subprocess.run(\"unzip -qq -o bank-additional.zip\", shell=True)\n",
-    "subprocess.run(\"rm bank-additional.zip\", shell=True)\n",
-    "\n",
-    "local_data_path = \"./bank-additional/bank-additional-full.csv\"\n",
-    "data = pd.read_csv(local_data_path)\n",
+    "# Download the data\n",
+    "s3 = boto3.client(\"s3\")\n",
+    "s3.download_file(\"autogluon\", \"datasets/Inc/train.csv\", \"train.csv\")\n",
+    "s3.download_file(\"autogluon\", \"datasets/Inc/test.csv\", \"test.csv\")\n",
     "\n",
     "# Split train/test data\n",
-    "train = data.sample(frac=0.7, random_state=42)\n",
-    "test = data.drop(train.index)\n",
+    "train = pd.read_csv('train.csv')\n",
+    "test = pd.read_csv('test.csv')\n",
     "\n",
     "# Split test X/y\n",
-    "label = \"y\"\n",
+    "label = \"class\"\n",
     "y_test = test[label]\n",
     "X_test = test.drop(columns=[label])"
    ]
@@ -220,7 +204,7 @@
    "outputs": [],
    "source": [
     "# Define required label and optional additional parameters\n",
-    "init_args = {\"label\": \"y\"}\n",
+    "init_args = {\"label\": \"class\"}\n",
     "\n",
     "# Define additional parameters\n",
     "fit_args = {\n",
@@ -434,7 +418,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.10"
+   "version": "3.6.13"
   }
  },
  "nbformat": 4,