Skip to content

Commit 9fabece

Browse files
authored
Towards AutoAD (#395)
* towards autoAD * long overdue documentation update * fixes * fixes * fixes, really! * adding conditional forecast via near neighbors * fixes
1 parent 5ab5a97 commit 9fabece

File tree

65 files changed

+1489
-582
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1489
-582
lines changed

Java/README.md

+49-28
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,10 @@
22

33
This directory contains a Java implementation of the Random Cut Forest data structure and algorithms
44
for anomaly detection, density estimation, imputation, and forecast. The goal of this library
5-
is to be easy to use and to strike a balance between efficiency and extensibility. Please see randomcutforest-examples
6-
for a few detailed examples and extensions.
5+
is to be easy to use and to strike a balance between efficiency and extensibility. Please do not forget
6+
to look into the ParkServices package that provide many augmented functionalities such as explicit determination
7+
of anomaly grade based on the first hand understanding of the core algorithm. Please also see randomcutforest-examples
8+
for a few detailed examples and extensions. Please do not hesitate to creat an issue for any discussion item.
79

810
## Basic operations
911

@@ -13,20 +15,26 @@ To create a RandomCutForest instance with all parameters set to defaults:
1315
int dimensions = 5; // The number of dimensions in the input data, required
1416
RandomCutForest forest = RandomCutForest.defaultForest(dimensions);
1517
```
16-
18+
We recommend using shingle size which correspond to contextual analysis of data,
19+
and RCF uses ideas not dissimilar from higher order Markov Chains to improve its
20+
accuracy. An option is provided to have the shingles be constructed internally.
1721
To explicitly set optional parameters like number of trees in the forest or
18-
sample size, RandomCutForest provides a builder:
22+
sample size, RandomCutForest provides a builder (for example with 4 input dimensions for
23+
a 4-way multivariate analysis):
1924

2025
```java
2126
RandomCutForest forest = RandomCutForest.builder()
22-
.numberOfTrees(90)
23-
.sampleSize(200)
24-
.dimensions(2) // still required!
25-
.lambda(0.2)
26-
.randomSeed(123)
27-
.storeSequenceIndexesEnabled(true)
28-
.centerOfMassEnabled(true)
29-
.build();
27+
.numberOfTrees(90)
28+
.sampleSize(200) // use this cover the phenomenon of interest
29+
// for analysis of 5 minute aggregations, a week has
30+
// about 12 * 24 * 7 starting points of interest
31+
// larger sample sizes will be larger models
32+
.dimensions(inputDimension*4) // still required!
33+
.timeDecay(0.2) // determines half life of data
34+
.randomSeed(123)
35+
.internalShingleEnabled(true)
36+
.shingleSize(7)
37+
.build();
3038
```
3139

3240
Typical usage of a forest is to compute a statistic on an input data point and then update the forest with that point
@@ -53,27 +61,40 @@ while (true) {
5361

5462
The following parameters can be configured in the RandomCutForest builder.
5563

56-
| Parameter Name | Type | Description | Default Value|
57-
| --- | --- | --- | --- |
58-
| centerOfMassEnabled | boolean | If true, then tree nodes in the forest will compute their center of mass as part of tree update operations. | false |
59-
| dimensions | int | The number of dimensions in the input data. | Required, no default value |
60-
| lambda | double | The decay factor used by stream samplers in this forest. See the next section for guidance. | 1 / (10 * sampleSize) |
61-
| numberOfTrees | int | The number of trees in this forest. | 50 |
62-
| outputAfter | int | The number of points required by stream samplers before results are returned. | 0.25 * sampleSize |
63-
| parallelExecutionEnabled | boolean | If true, then the forest will create an internal threadpool. Forest updates and traversals will be submitted to this threadpool, and individual trees will be updated or traversed in parallel. For larger shingle sizes, dimensions, and number of trees, parallelization may improve throughput. We recommend users benchmark against their target use case. | false |
64-
| randomSeed | long | A seed value used to initialize the random number generators in this forest. | |
65-
| sampleSize | int | The sample size used by stream samplers in this forest | 256 |
66-
| storeSequenceIndexesEnabled | boolean | If true, then sequence indexes (ordinals indicating when a point was added to a tree) will be stored in the forest along with poitn values. | false |
67-
| threadPoolSize | int | The number of threads to use in the internal threadpool. | Number of available processors - 1 |
68-
69-
## Choosing a `lambda` value for your application
64+
| Parameter Name | Type | Description | Default Value |
65+
|-----------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|
66+
| dimensions | int | The number of dimensions in the input data. | Required, no default value. Should be the product of input dimensions and shingleSize |
67+
| shingleSize | int | The number of contiguous observations across all the input variables that would be used for analysis | Strongly recommended for contextual anomalies. Required for Forecast/Extrapolate |
68+
| lambda | double | The decay factor used by stream samplers in this forest. See the next section for guidance. | 1 / (10 * sampleSize) |
69+
| numberOfTrees | int | The number of trees in this forest. | 50 |
70+
| outputAfter | int | The number of points required by stream samplers before results are returned. | 0.25 * sampleSize |
71+
| internalShinglingEnabled | boolean | Whether the shingling is performed by RCF itself since it has already seen previous values. | false (for historical reasons). Recommended : true, will result in smaller models. |
72+
| parallelExecutionEnabled | boolean | If true, then the forest will create an internal threadpool. Forest updates and traversals will be submitted to this threadpool, and individual trees will be updated or traversed in parallel. For larger shingle sizes, dimensions, and number of trees, parallelization may improve throughput. We recommend users benchmark against their target use case. | false |
73+
| randomSeed | long | A seed value used to initialize the random number generators in this forest. | |
74+
| sampleSize | int | The sample size used by stream samplers in this forest | 256 |
75+
| centerOfMassEnabled | boolean | If true, then tree nodes in the forest will compute their center of mass as part of tree update operations. | false |
76+
| storeSequenceIndexesEnabled | boolean | If true, then sequence indexes (ordinals indicating when a point was added to a tree) will be stored in the forest along with poitn values. | false |
77+
| threadPoolSize | int | The number of threads to use in the internal threadpool. | Number of available processors - 1 |
78+
79+
The above parameters are the most common and historical. Please use the issues to request additions/discussions of other parameters of interest.
80+
81+
RandomCutForest primarily provides an estimation (say anomaly score, or extrapolation over a forecast horizon) and using that raw estimation can be challenging. The ParkServices package provides
82+
several capabilities (ThresholdedRandomCutForest, RCFCaster, respectively) for distilling the scores to a determination of
83+
anomaly/otherwise (an assesment of grade) or calibrated conformal forecasts. These have natural parameter choices that are different
84+
from the core RandomCutForest -- for example internalShinglingEnabled defaults to true since that is more natural in those contexts.
85+
The package examples provides a collection of examples and uses of parameters, we draw the attention to ThresholdedMultiDimensionalExample
86+
and RCFCasterExample. If one is interested in sequential analysis of a series of consecutive inputs, check out SequentialAnomalyExample.
87+
ParkServices also exposes many other functionalities of RCF which were covert, such as clustering (including multi-centroid representations)
88+
-- see NumericGLADExample for instance.
89+
90+
## Choosing a `timeDecay` value for your application
7091

7192
When we submit a point to the sampler, it is included into the sample with some probability, and
7293
it will remain in the for some number of steps before being replaced. Call the number of steps that
7394
a point is included in the sample the "lifetime" of the point (which may be 0). Over a finite time
7495
window, the distribution of the lifetime of a point is approximately exponential with parameter
75-
`lambda`. Thus, `1 / lambda` is approximately the average number of steps that a point will be included
76-
in the sample. By default, we set `lambda` equal to `1 / (10 * sampleSize)`.
96+
`lambda`. Thus, `1 / timmeDecay` is approximately the average number of steps that a point will be included
97+
in the sample. By default, we set `timeDecay` equal to `1 / (10 * sampleSize)`.
7798

7899
Alternatively, if you want the probability that a point survives longer than n steps to be 0.05,
79100
you can solve for `lambda` in the equation `exp(-lambda * n) = 0.05`.

Java/core/src/main/java/com/amazon/randomcutforest/RandomCutForest.java

+6
Original file line numberDiff line numberDiff line change
@@ -1228,6 +1228,7 @@ public float[] extrapolateFromCurrentTime(int horizon) {
12281228
* considered a neighbor.
12291229
* @return a list of Neighbors, ordered from closest to furthest.
12301230
*/
1231+
@Deprecated
12311232
public List<Neighbor> getNearNeighborsInSample(double[] point, double distanceThreshold) {
12321233
return getNearNeighborsInSample(toFloatArray(point), distanceThreshold);
12331234
}
@@ -1258,7 +1259,12 @@ public List<Neighbor> getNearNeighborsInSample(float[] point, double distanceThr
12581259
* @param point A point whose neighbors we want to find.
12591260
* @return a list of Neighbors, ordered from closest to furthest.
12601261
*/
1262+
@Deprecated
12611263
public List<Neighbor> getNearNeighborsInSample(double[] point) {
1264+
return getNearNeighborsInSample(toFloatArray(point));
1265+
}
1266+
1267+
public List<Neighbor> getNearNeighborsInSample(float[] point) {
12621268
return getNearNeighborsInSample(point, Double.POSITIVE_INFINITY);
12631269
}
12641270

Java/core/src/main/java/com/amazon/randomcutforest/anomalydetection/AbstractAttributionVisitor.java

+6-5
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,8 @@ public void accept(INodeView node, int depthOfNode) {
156156
}
157157
}
158158

159-
if ((hitDuplicates || ignoreLeaf) && (pointInsideBox || depthOfNode == 0)) {
159+
boolean capture = (pointInsideBox || depthOfNode == 0);
160+
if ((hitDuplicates || ignoreLeaf) && capture) {
160161
// final rescaling; this ensures agreement with the ScalarScoreVector
161162
// the scoreUnseen/scoreSeen should be the same as scoring; other uses need
162163
// caution.
@@ -170,7 +171,9 @@ public void acceptLeaf(INodeView leafNode, int depthOfNode) {
170171

171172
updateRangesForScoring(leafNode.getBoundingBox(), leafNode.getBoundingBox().getMergedBox(pointToScore));
172173

173-
if (Arrays.equals(leafNode.getLeafPoint(), pointToScore)) {
174+
// newrange == 0 corresponds to equality of points and is fater than
175+
// Array.equals
176+
if (sumOfNewRange <= 0) {
174177
hitDuplicates = true;
175178
}
176179

@@ -180,9 +183,7 @@ public void acceptLeaf(INodeView leafNode, int depthOfNode) {
180183
savedScore = scoreUnseen(depthOfNode, leafNode.getMass());
181184
}
182185

183-
if ((hitDuplicates) || ((ignoreLeaf) && (leafNode.getMass() <= ignoreLeafMassThreshold))
184-
|| sumOfNewRange <= 0) {
185-
186+
if ((hitDuplicates) || ((ignoreLeaf) && (leafNode.getMass() <= ignoreLeafMassThreshold))) {
186187
Arrays.fill(directionalAttribution.high, savedScore / (2 * pointToScore.length));
187188
Arrays.fill(directionalAttribution.low, savedScore / (2 * pointToScore.length));
188189
/* in this case do not have a better option than an equal attribution */

Java/core/src/main/java/com/amazon/randomcutforest/anomalydetection/DynamicAttributionVisitor.java

-17
Original file line numberDiff line numberDiff line change
@@ -66,23 +66,6 @@ public DynamicAttributionVisitor(float[] point, int treeMass, int ignoreLeafMass
6666
this.damp = damp;
6767
}
6868

69-
/**
70-
* Same as above with a default non-dampening
71-
*
72-
* @param point to be scored
73-
* @param treeMass mass of the tree
74-
* @param ignoreLeafMassThreshold mass of the leaves to be ignored
75-
* @param scoreSeen score when point has been seen
76-
* @param scoreUnseen score when point has not been seen
77-
*/
78-
public DynamicAttributionVisitor(float[] point, int treeMass, int ignoreLeafMassThreshold,
79-
BiFunction<Double, Double, Double> scoreSeen, BiFunction<Double, Double, Double> scoreUnseen) {
80-
super(point, treeMass, ignoreLeafMassThreshold);
81-
this.scoreSeen = scoreSeen;
82-
this.scoreUnseen = scoreUnseen;
83-
this.damp = (x, y) -> 1.0;
84-
}
85-
8669
@Override
8770
protected double scoreSeen(int depth, int leafMass) {
8871
return scoreSeen.apply((double) depth, (double) leafMass);
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
/*
2+
* Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License").
5+
* You may not use this file except in compliance with the License.
6+
* A copy of the License is located at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* or in the "license" file accompanying this file. This file is distributed
11+
* on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
12+
* express or implied. See the License for the specific language governing
13+
* permissions and limitations under the License.
14+
*/
15+
16+
package com.amazon.randomcutforest.config;
17+
18+
/**
19+
* Options for using RCF, specially with thresholds
20+
*/
21+
public enum CorrectionMode {
22+
23+
/**
24+
* default behavior, no correction
25+
*/
26+
NONE,
27+
28+
/**
29+
* due to transforms, or due to input noise
30+
*/
31+
NOISE,
32+
33+
/**
34+
* elimination due to multi mode operation
35+
*/
36+
37+
MULTI_MODE,
38+
39+
/**
40+
* effect of an anomaly in shingle
41+
*/
42+
43+
ANOMALY_IN_SHINGLE,
44+
45+
/**
46+
* conditional forecast, using conditional fields
47+
*/
48+
49+
CONDITIONAL_FORECAST,
50+
51+
/**
52+
* forecasted value was not very different
53+
*/
54+
55+
FORECAST,
56+
57+
/**
58+
* data drifts and level shifts, will not be corrected unless level shifts are
59+
* turned on
60+
*/
61+
62+
DATA_DRIFT
63+
64+
}

Java/core/src/main/java/com/amazon/randomcutforest/imputation/ConditionalSampleSummarizer.java

+5-7
Original file line numberDiff line numberDiff line change
@@ -137,9 +137,12 @@ public SampleSummary summarize(List<ConditionalTreeSample> alist, boolean addTyp
137137
double currentWeight = 0;
138138
int alwaysInclude = 0;
139139
double remainderWeight = totalWeight;
140-
while (alwaysInclude < newList.size() && newList.get(alwaysInclude).distance == 0) {
140+
while (newList.get(alwaysInclude).distance == 0) {
141141
remainderWeight -= newList.get(alwaysInclude).weight;
142142
++alwaysInclude;
143+
if (alwaysInclude == newList.size()) {
144+
break;
145+
}
143146
}
144147
for (int j = 1; j < newList.size(); j++) {
145148
if ((currentWeight < remainderWeight / 3 && currentWeight + newList.get(j).weight >= remainderWeight / 3)
@@ -161,7 +164,6 @@ public SampleSummary summarize(List<ConditionalTreeSample> alist, boolean addTyp
161164
ArrayList<Weighted<float[]>> typicalPoints = new ArrayList<>();
162165
for (int j = 0; j < num; j++) {
163166
ConditionalTreeSample e = newList.get(j);
164-
165167
float[] values;
166168
if (project) {
167169
values = new float[missingDimensions.length];
@@ -171,11 +173,7 @@ public SampleSummary summarize(List<ConditionalTreeSample> alist, boolean addTyp
171173
} else {
172174
values = Arrays.copyOf(e.leafPoint, dimensions);
173175
}
174-
// weight is changed for clustering,
175-
// based on the distance of the sample from the query point
176-
double weight = (e.distance <= threshold) ? e.weight : e.weight * threshold / e.distance;
177-
typicalPoints.add(new Weighted<>(values, (float) weight));
178-
176+
typicalPoints.add(new Weighted<>(values, (float) e.weight));
179177
}
180178
int maxAllowed = min(queryPoint.length * MAX_NUMBER_OF_TYPICAL_PER_DIMENSION, MAX_NUMBER_OF_TYPICAL_ELEMENTS);
181179
maxAllowed = min(maxAllowed, num);

0 commit comments

Comments
 (0)