Add documentation

jotok · jotok · commit 12d725cee5ba · 2017-09-06T17:21:38.000-07:00
diff --git a/README.md b/README.md
@@ -0,0 +1 @@
+This repository contains a simple implementation of a time-space clustering model implemented as a [jupyter](http://jupyter.org/) notebook. See the notebook for more information.
diff --git a/simple-time-space-clustering.ipynb b/simple-time-space-clustering.ipynb
@@ -1,5 +1,24 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this notebook we propose a method to efficiently cluster time-space data using the existing DBSCAN implementation in [scikit-learn](http://scikit-learn.org/). Before taking this approach, we considered two other approaches to time-space clustering. The first was to use an [existing implementation](https://github.com/eubr-bigsea/py-st-dbscan) of [ST-DBSCAN](http://www.sciencedirect.com/science/article/pii/S0169023X06000218). This implementation uses [COMPSS](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/) for distributed computing, which may be nontrivial to quickly deploy. The second approach that we considered was to implement a custom time-space metric in python which we could pass into the scikit-learn DBSCAN implementation. However, we were concerned that this approach would not scale very well to large data sets compared to metrics predefined (and optimized) in scikit-learn.\n",
+    "\n",
+    "Our approach is to precompute a sparse spatial distance matrix where a given pairwise distance is only included if the time-distance is smaller than a given threshold. That is, for pairs of data points, we do the following:\n",
+    "\n",
+    "1. If the time distance is smaller than a given threshold value, and\n",
+    "2. If the space distance is smaller than a second given threshold, then\n",
+    "3. Include the space distance in the sparse pairwise distance matrix\n",
+    "\n",
+    "One way to think of this approach is that we are defining a time-space metric in which the time component only takes two values: smaller-than-epsilon and infinity.\n",
+    "\n",
+    "If the input data set is sorted by timestamp, then computing this distance matrix can be very efficient because you only ever need to compare pairs of points that are physically near each other in the data. In fact, it may be possible to modify this approach to apply and record a custom time-space metric (i.e., to try the second approach that we rejected for scaling badly when directly supplied to the DBSCAN implementation).\n",
+    "\n",
+    "Compared to ST-DBSCAN, this approach incorporates the time-distance in a very crude way and we expect the results will reflect that."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -30,7 +49,7 @@
    },
    "outputs": [],
    "source": [
-    "# In this cell we define the function used to compute the sparse distance matrix\n",
+    "# Define the function used to compute the sparse distance matrix\n",
     "\n",
     "\"\"\"Return the index of the first element x in a collection such that pred(x) is True.\"\"\"\n",
     "def find(coll, pred):\n",
@@ -69,11 +88,12 @@
     "        space_window = collections.deque() # [lat, long]\n",
     "        \n",
     "        for row in rr:\n",
-    "            lat, long, current_ts = row['lat'], row['lon'], row['date']\n",
+    "            lat, long, current_ts = row['lat'], row['lon'], row['date'] \n",
+    "            current_coords = [lat, long]\n",
     "            current_ts = datetime.strptime(current_ts, date_format)\n",
     "            \n",
     "            try:\n",
-    "                number_to_drop = find(time_window, lambda ts: current_ts - ts < time_threshold)\n",
+    "                number_to_drop = find(time_window, lambda ts: current_ts - ts <= time_threshold)\n",
     "                left_index += number_to_drop\n",
     "                dropn(time_window, number_to_drop)\n",
     "                dropn(space_window, number_to_drop)\n",
@@ -82,12 +102,10 @@
     "                time_window.clear()\n",
     "                space_window.clear()\n",
     "            \n",
-    "            current_coords = [lat, long]\n",
-    "            \n",
     "            if len(space_window) > 0:\n",
     "                distances = dist.pairwise(space_window, [current_coords])\n",
     "                for i, d in enumerate(np.nditer(distances)):\n",
-    "                    if d < space_threshold:\n",
+    "                    if d <= space_threshold:\n",
     "                        wr.writerow({\n",
     "                                \"x\": left_index + i,\n",
     "                                \"y\": right_index,\n",
@@ -120,7 +138,7 @@
    "source": [
     "# Set the input and output locations\n",
     "\n",
-    "working_directory = \"/Users/joshtok/Code/geo-clustering\"\n",
+    "working_directory = \"/path/to/data\"\n",
     "infile = os.path.join(working_directory, \"summer-travel-gps-full.csv\")\n",
     "outfile = os.path.join(working_directory, \"sparse_distance_matrix.csv\")\n",
     "\n",

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+This repository contains a simple implementation of a time-space clustering model implemented as a [jupyter](http://jupyter.org/) notebook. See the notebook for more information.`