home | copyright ©2019, tjmenzie@ncsu.edu

syllabus | src | submit | chat

Clustering

(What to do when you have no labels for classification/regression.)

https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html:

BTW, once you get these clusters, you can use them to guide your search for labels

e.g. just ask about a few (maybe eve just one) label cluster

And you are haff way to an optimizer:

If you are in clusterX, and you don't like it
And you want to be in clusterY
Then just run a decision tree algorithm with 2 classes, this cluster and the other one.

Distance functions.

Many. e.g. p-norm

D= (∑ (δ(x,y))^p))^1/p
euclidean : p=2

But what is δ ?

Symbols:
- return 0 if x == y else 1
Numbers:
- x - y
- to make numbers fair with symbols, normalize x,y 0,1 using (x-min)/(max-min)

What about missing values:

assume worst case
if both unknown, assume δ = 1
if one symbol missing, assume δ = 1
if one number missing:
- let x,y be unknown, known
- y = normalize(y)
- x = 0 if y > 0.5 else 1
- δ = (x-y)

Curse of Dimensionality

Distance gets weird for high dimensions

for an large dimensional orange, most of the mass is in the skin
volume of the space increases so fast that the available data become sparse.
amount of data needed to support the result grows exponentially with dimensions

KD-trees

Simplicity itself (much faster than KD trees)

Find wildest dimension
- e.g. Most variance,
- e.g. Greatest max - mix
Divide in two on that dimension
- e.g. At median point
Recurs on both halves
Stop when you've found, say,sqrt(N) of the data

Problem: curse of dimensionality

For large dimensional problems
After a few cuts, each cut is now to small to separated
- and the data in each cut is still wildly variable since we've missed important dimensions

Random Projections

technique used to reduce the dimensionality of a set of points
known for their power, simplicity, and low error rates when compared to other methods
if n randomly selected dimension say you are similar to something else
- then you are probably similar

Method one: Guassian random projections
- Matrix = rows * cols
- Matrix A,B
- A = m × n
- B = n × p
- C = A*B = m x p
- So we can reduce n=1000 cols in matrix A to p=30 cols in C via a matrix
  - 1000 row by 30 cols
- Initialize B by filling each column with a random number pulled from a normal bell curve
- Only works for numbers
- Requires all the data in RAM (bad for big data sets)
Method two: LSH: locality sensitivity hashing
- Find a way to get an approximate position for a row, in a reduced space
  - e.g. repeat d times
    - Find two distant "pivots" = (p,o)
      - Pick, say, 30 rows at random then find within them, the most distant rows
    - the i-th dimension:
      - this row's d.i = 0 if (row closer to p than o) else 1
  - repeat d=log(sqrt(N))+1 pivots to get d random projectsion
- If you want not 0,1 on each dimension but a continuous number then:
  - given pivots (p,o) seperated by "c"
  - a = dist(row,p)
  - b = dist(row,o)
  - this row's d.i = (a^2 + c^2 - b^2) / (2c)
    - Cosine rule
- Can be done via random sampling over some of the data.
  - Better for bigger data
  - But less exact
  - still, darn useful

BTW, now we can solve the KD tree dimensionality problem.

Using random projection:
- To find wildest dimension
- At each level, do LSH 30 times and select the points that are most distant (*)

(*) beware outliers :

A safe thing might be to sort the pivots by their distance and take something that is 90% of max distance

Kmeans:

done

Mini-batch k-means

Done

For scaling to very large data sets, we make a grand and glorious hack to the grand and glorious K-means hack.
Mini-batch K-means reads the data in eras (called "batches") of size "b".
- As with K-means, we label each arriving example with the id of its nearest sample (and initially, we pick those centroids at random). By the way, this is analogous to the "E" step
- For the "M" step, we assume that the more examples a centroid has seen, the More we believe in it and the less we want to move it
- "n" stores how often this centroid has been picked by new data
- So in the "E" step, each item "pulls" its centroid attribute "c" towards its own attribute "x" by an amount weighted c = (1-1/n)*c + x/n.
- Note that when "n" is large, "c" barely moves at all.

Exercise

A method for initialising the K-means clustering algorithm using kd-trees

What is the Forgy method (section 1.1)
What is the Bradley and Fayyad's method? (section 1.2)
What is your favorite other method?
What is the method of this paper?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster.md

cluster.md

Clustering

Distance functions.

Curse of Dimensionality

KD-trees

Random Projections

Kmeans:

Mini-batch k-means

Exercise

Files

cluster.md

Latest commit

History

cluster.md

File metadata and controls

Clustering

Distance functions.

Curse of Dimensionality

KD-trees

Random Projections

Kmeans:

Mini-batch k-means

Exercise