Skip to content

Latest commit

 

History

History
161 lines (116 loc) · 5.54 KB

cluster.md

File metadata and controls

161 lines (116 loc) · 5.54 KB

 

home | copyright ©2019, tjmenzie@ncsu.edu

syllabus | src | submit | chat

Clustering

(What to do when you have no labels for classification/regression.)

https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html:

BTW, once you get these clusters, you can use them to guide your search for labels

  • e.g. just ask about a few (maybe eve just one) label cluster

And you are haff way to an optimizer:

  • If you are in clusterX, and you don't like it
  • And you want to be in clusterY
  • Then just run a decision tree algorithm with 2 classes, this cluster and the other one.

Distance functions.

Many. e.g. p-norm

  • D= (∑ (δ(x,y))p))1/p
  • euclidean : p=2

But what is δ ?

  • Symbols:
    • return 0 if x == y else 1
  • Numbers:
    • x - y
    • to make numbers fair with symbols, normalize x,y 0,1 using (x-min)/(max-min)

What about missing values:

  • assume worst case
  • if both unknown, assume δ = 1
  • if one symbol missing, assume δ = 1
  • if one number missing:
    • let x,y be unknown, known
    • y = normalize(y)
    • x = 0 if y > 0.5 else 1
    • δ = (x-y)

Curse of Dimensionality

Distance gets weird for high dimensions

  • for an large dimensional orange, most of the mass is in the skin
  • volume of the space increases so fast that the available data become sparse.
  • amount of data needed to support the result grows exponentially with dimensions

KD-trees

Simplicity itself (much faster than KD trees)

  • Find wildest dimension
    • e.g. Most variance,
    • e.g. Greatest max - mix
  • Divide in two on that dimension
    • e.g. At median point
  • Recurs on both halves
  • Stop when you've found, say,sqrt(N) of the data

Problem: curse of dimensionality

  • For large dimensional problems
  • After a few cuts, each cut is now to small to separated
    • and the data in each cut is still wildly variable since we've missed important dimensions

Random Projections

  • technique used to reduce the dimensionality of a set of points
  • known for their power, simplicity, and low error rates when compared to other methods
  • if n randomly selected dimension say you are similar to something else
    • then you are probably similar

  • Method one: Guassian random projections
    • Matrix = rows * cols
    • Matrix A,B
    • A = m × n
    • B = n × p
    • C = A*B = m x p
    • So we can reduce n=1000 cols in matrix A to p=30 cols in C via a matrix
      • 1000 row by 30 cols
    • Initialize B by filling each column with a random number pulled from a normal bell curve
    • Only works for numbers
    • Requires all the data in RAM (bad for big data sets)
  • Method two: LSH: locality sensitivity hashing
    • Find a way to get an approximate position for a row, in a reduced space
      • e.g. repeat d times
        • Find two distant "pivots" = (p,o)
          • Pick, say, 30 rows at random then find within them, the most distant rows
        • the i-th dimension:
          • this row's d.i = 0 if (row closer to p than o) else 1
      • repeat d=log(sqrt(N))+1 pivots to get d random projectsion
    • If you want not 0,1 on each dimension but a continuous number then:
      • given pivots (p,o) seperated by "c"
      • a = dist(row,p)
      • b = dist(row,o)
      • this row's d.i = (a^2 + c^2 - b^2) / (2c)
        • Cosine rule
    • Can be done via random sampling over some of the data.
      • Better for bigger data
      • But less exact
      • still, darn useful

BTW, now we can solve the KD tree dimensionality problem.

  • Using random projection:
    • To find wildest dimension
    • At each level, do LSH 30 times and select the points that are most distant (*)

(*) beware outliers :

  • A safe thing might be to sort the pivots by their distance and take something that is 90% of max distance

Kmeans:

done

Mini-batch k-means

Done

  • For scaling to very large data sets, we make a grand and glorious hack to the grand and glorious K-means hack.
  • Mini-batch K-means reads the data in eras (called "batches") of size "b".
    • As with K-means, we label each arriving example with the id of its nearest sample (and initially, we pick those centroids at random). By the way, this is analogous to the "E" step
    • For the "M" step, we assume that the more examples a centroid has seen, the More we believe in it and the less we want to move it
    • "n" stores how often this centroid has been picked by new data
    • So in the "E" step, each item "pulls" its centroid attribute "c" towards its own attribute "x" by an amount weighted c = (1-1/n)*c + x/n.
    • Note that when "n" is large, "c" barely moves at all.

Exercise

A method for initialising the K-means clustering algorithm using kd-trees

  • What is the Forgy method (section 1.1)
  • What is the Bradley and Fayyad's method? (section 1.2)
  • What is your favorite other method?
  • What is the method of this paper?