home |
copyright ©2019, tjmenzie@ncsu.edu
syllabus |
src |
submit |
chat
(What to do when you have no labels for classification/regression.)
https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html:
BTW, once you get these clusters, you can use them to guide your search for labels
- e.g. just ask about a few (maybe eve just one) label cluster
And you are haff way to an optimizer:
- If you are in clusterX, and you don't like it
- And you want to be in clusterY
- Then just run a decision tree algorithm with 2 classes, this cluster and the other one.
Many. e.g. p-norm
- D= (∑ (δ(x,y))p))1/p
- euclidean : p=2
But what is δ ?
- Symbols:
- return 0 if x == y else 1
- Numbers:
- x - y
- to make numbers fair with symbols, normalize x,y 0,1 using (x-min)/(max-min)
What about missing values:
- assume worst case
- if both unknown, assume δ = 1
- if one symbol missing, assume δ = 1
- if one number missing:
- let x,y be unknown, known
- y = normalize(y)
- x = 0 if y > 0.5 else 1
- δ = (x-y)
Distance gets weird for high dimensions
- for an large dimensional orange, most of the mass is in the skin
- volume of the space increases so fast that the available data become sparse.
- amount of data needed to support the result grows exponentially with dimensions
Simplicity itself (much faster than KD trees)
- Find wildest dimension
- e.g. Most variance,
- e.g. Greatest max - mix
- Divide in two on that dimension
- e.g. At median point
- Recurs on both halves
- Stop when you've found, say,sqrt(N) of the data
Problem: curse of dimensionality
- For large dimensional problems
- After a few cuts, each cut is now to small to separated
- and the data in each cut is still wildly variable since we've missed important dimensions
- technique used to reduce the dimensionality of a set of points
- known for their power, simplicity, and low error rates when compared to other methods
- if n randomly selected dimension say you are similar to something else
- then you are probably similar
- Method one: Guassian random projections
- Matrix = rows * cols
- Matrix A,B
- A = m × n
- B = n × p
- C = A*B = m x p
- So we can reduce n=1000 cols in matrix A to p=30 cols in C via a matrix
- 1000 row by 30 cols
- Initialize B by filling each column with a random number pulled from a normal bell curve
- Only works for numbers
- Requires all the data in RAM (bad for big data sets)
- Method two: LSH: locality sensitivity hashing
- Find a way to get an approximate position for a row, in a reduced space
- e.g. repeat d times
- Find two distant "pivots" = (p,o)
- Pick, say, 30 rows at random then find within them, the most distant rows
- the i-th dimension:
- this row's d.i = 0 if (row closer to p than o) else 1
- Find two distant "pivots" = (p,o)
- repeat d=log(sqrt(N))+1 pivots to get d random projectsion
- e.g. repeat d times
- If you want not 0,1 on each dimension but a continuous number then:
- given pivots (p,o) seperated by "c"
- a = dist(row,p)
- b = dist(row,o)
- this row's d.i = (a^2 + c^2 - b^2) / (2c)
- Cosine rule
- Can be done via random sampling over some of the data.
- Better for bigger data
- But less exact
- still, darn useful
- Find a way to get an approximate position for a row, in a reduced space
BTW, now we can solve the KD tree dimensionality problem.
- Using random projection:
- To find wildest dimension
- At each level, do LSH 30 times and select the points that are most distant (*)
(*) beware outliers :
- A safe thing might be to sort the pivots by their distance and take something that is 90% of max distance
done
Done
- For scaling to very large data sets, we make a grand and glorious hack to the grand and glorious K-means hack.
- Mini-batch K-means reads the data in eras (called "batches") of size "b".
- As with K-means, we label each arriving example with the id of its nearest sample (and initially, we pick those centroids at random). By the way, this is analogous to the "E" step
- For the "M" step, we assume that the more examples a centroid has seen, the More we believe in it and the less we want to move it
- "n" stores how often this centroid has been picked by new data
- So in the "E" step, each item "pulls" its centroid attribute "c" towards its own attribute "x" by an amount weighted
c = (1-1/n)*c + x/n
. - Note that when "n" is large, "c" barely moves at all.
A method for initialising the K-means clustering algorithm using kd-trees
- What is the Forgy method (section 1.1)
- What is the Bradley and Fayyad's method? (section 1.2)
- What is your favorite other method?
- What is the method of this paper?