Skip to content

Latest commit

 

History

History
147 lines (101 loc) · 6.16 KB

hw8.md

File metadata and controls

147 lines (101 loc) · 6.16 KB

 

home | copyright ©2019, tjmenzie@ncsu.edu

syllabus | src | submit | chat

Homework 8

Last week you did divided up decision space (unsupervised recursive partitioning of the independent variables).

This week you will learn deltas in the objective (a.k.a. goal) space.

  • You will combining code from 2 prior homeworks
  • Have we avoided used globals in our code? Humm?

For each leaf cluster, find out who you envy.

  • We envy things that are close by and which are much better than us
  • We could also add a minimum goal if we wanted too (e.g. only be envious of the best half of the data).

For each pair of centroids (C1,C2):

  • Let Δ = Compute the delta between their independent variables for C1,C2
    • using the distance calculations from last week supervised tree learning on raw attributes.
  • Let ε = How much times this centroid's goals dominate the other centroids
  • Engineering tip: express both those numbers as ratios of zero to one
    • So divide Δ by sqrt(number of independent attributes)
    • So normalize ε(C1,C2) by divide it by the sum of all ε(C1,Cx) where Cx are all the other centroids in from the cluster tree.
  • Envy can now be calculated many ways
    • We will use envy = (1-Δ) - ε
      • The minus sign on ε comes from below.

Step1: divide decision space (done! see last week)

Step2: find centroids

  • Mean/mode of independent variables

Step3 Build a Domination Predicate

To compute domination:

  • Modify your table class so it knows about goals we want to minimize and maximize. Anything column marked with "<,>" gets a wright of -1,1 and denotes goals you want to minimize,maximize.
    • Engineering tip:
      • Give each column a default weight of 1, in the __init__ method
      • Only for the columns with < do you need to change that to -1
  • Add a function that returns true if one set of goals is better than another.
    • Engineering tip: Best way to do this is reflect on the delta in the value of each goal, raised to a power (see below). WARNING: normalize the goals 0..1 before raising to a power. Otherwise, it will blow up!
    • The official domination predicate is i dominates j when s1/n < s2/n. So large negative numbers are better! Your code needs to handle that.
def dominates(i,j,goals): # i and j are rows.
    z = 0.00001
    s1, s2, n = z,z,z+len(goals)
    for goal in goals:
      a,b = i.cells[goal.pos], j.cells[goal.pos]
      a,b = goal.norm(a), goal.norm(b)  
      s1 -= 10**(goal.w * (a-b)/n)
      s2 -= 10**(goal.w * (b-a)/n)
    return s1/n - s2/n # i is better if it losses least (i.e. this number under 0)

Step 4: Test your Domination Predicate

For the auto.csv data, sort each row by

  • picking 100 randomly selected other rows.
  • Count how often this row i has dominates(i,j) <0` with the other 100 rows
  • Sort this rows by that count

Hand in a file rowsBestRest.md containing the first 4 and the last 4 rows in that sort. You should be seeing something like the following (i.e. the best rws have lowest weight , faster acceleration, longest miles per hour).

cylinder  displacmnt  hpower  <weight  >acceltn  model  origin  >mpg
best 4 >85 <46 1975 19.4 >81 3 40
best 4 >85 <65 1985 21.5 >78 2 40
best 4 >85 <65 2085 21.7 >80 2 40
best 4 >96 <65 2130 24.6 >82 2 40
.. . ... ... ... ... .. ... ... ...
worst 8 >383 >165 4746 12 <71 1 10
worst 8 >3835 >165 4951 11 <73 1 10
worst 8 >383 >165 4952 11.5 <73 1 10
worst 8 >383 >165 4955 11.5 <71 1 10
minimize maximize maximize

(And you do not need to copy the above format exactly. Near enough is good enough)

Step 5: Look at Envy amongst the centroids

Just compare the centroids of each cluster (mean/mode values)

This is a heuristic to decide which two clusters we will comapre

Note that there will be one cluster that we will never contrast (the best cluster).

Step6 Learn a Contrast Set

For the auto.csv data, cluster the data. For each cluster, build a two class data set:

  • class1= this cluster
  • class2= the cluster you most envy

Learn a decision tree that seperates the two clusters.

  • Important: learn from all the data in cluster1 and cluster2 (not just the centroids)

Hand in a file trees.md showing a print out of the trees

Hand in a file comment.md: good knowledge is succinct knowledge. Are the trees generated in this homework "good"?

Step 7: Comment on Local vs Global Reasoning:

You have just built a local reasoner that produces N different models for each N clusters.

The opposite approach would be to build one global by, say:

  • Rank each row by how much it dominates everyone else
  • Build a two class system: one for the best 20% rows, one for the 80% rest
  • Learn a tree from those classes

The great philosopher of knowledge Suvodeep Majumder, has comments on the value of local vs global reasoning in sections 2.1 and 2,2. Read those sections and hand in a file comment2.txt about a page of ascii text commenting on the value (or otherwise) of your local reasoner versus a more global approach. Feel free with disagree with Suvodeep. After all, he only did this class a year ago.