You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recommend using shingle size which correspond to contextual analysis of data,
19
+
and RCF uses ideas not dissimilar from higher order Markov Chains to improve its
20
+
accuracy. An option is provided to have the shingles be constructed internally.
17
21
To explicitly set optional parameters like number of trees in the forest or
18
-
sample size, RandomCutForest provides a builder:
22
+
sample size, RandomCutForest provides a builder (for example with 4 input dimensions for
23
+
a 4-way multivariate analysis):
19
24
20
25
```java
21
26
RandomCutForest forest =RandomCutForest.builder()
22
-
.numberOfTrees(90)
23
-
.sampleSize(200)
24
-
.dimensions(2) // still required!
25
-
.lambda(0.2)
26
-
.randomSeed(123)
27
-
.storeSequenceIndexesEnabled(true)
28
-
.centerOfMassEnabled(true)
29
-
.build();
27
+
.numberOfTrees(90)
28
+
.sampleSize(200) // use this cover the phenomenon of interest
29
+
// for analysis of 5 minute aggregations, a week has
30
+
// about 12 * 24 * 7 starting points of interest
31
+
// larger sample sizes will be larger models
32
+
.dimensions(inputDimension*4) // still required!
33
+
.timeDecay(0.2) // determines half life of data
34
+
.randomSeed(123)
35
+
.internalShingleEnabled(true)
36
+
.shingleSize(7)
37
+
.build();
30
38
```
31
39
32
40
Typical usage of a forest is to compute a statistic on an input data point and then update the forest with that point
@@ -53,27 +61,40 @@ while (true) {
53
61
54
62
The following parameters can be configured in the RandomCutForest builder.
55
63
56
-
| Parameter Name | Type | Description | Default Value|
57
-
| --- | --- | --- | --- |
58
-
| centerOfMassEnabled | boolean | If true, then tree nodes in the forest will compute their center of mass as part of tree update operations. | false |
59
-
| dimensions | int | The number of dimensions in the input data. | Required, no default value |
60
-
| lambda | double | The decay factor used by stream samplers in this forest. See the next section for guidance. | 1 / (10 * sampleSize) |
61
-
| numberOfTrees | int | The number of trees in this forest. | 50 |
62
-
| outputAfter | int | The number of points required by stream samplers before results are returned. | 0.25 * sampleSize |
63
-
| parallelExecutionEnabled | boolean | If true, then the forest will create an internal threadpool. Forest updates and traversals will be submitted to this threadpool, and individual trees will be updated or traversed in parallel. For larger shingle sizes, dimensions, and number of trees, parallelization may improve throughput. We recommend users benchmark against their target use case. | false |
64
-
| randomSeed | long | A seed value used to initialize the random number generators in this forest. ||
65
-
| sampleSize | int | The sample size used by stream samplers in this forest | 256 |
66
-
| storeSequenceIndexesEnabled | boolean | If true, then sequence indexes (ordinals indicating when a point was added to a tree) will be stored in the forest along with poitn values. | false |
67
-
| threadPoolSize | int | The number of threads to use in the internal threadpool. | Number of available processors - 1 |
68
-
69
-
## Choosing a `lambda` value for your application
64
+
| Parameter Name | Type | Description | Default Value |
| dimensions | int | The number of dimensions in the input data. | Required, no default value. Should be the product of input dimensions and shingleSize |
67
+
| shingleSize | int | The number of contiguous observations across all the input variables that would be used for analysis | Strongly recommended for contextual anomalies. Required for Forecast/Extrapolate |
68
+
| lambda | double | The decay factor used by stream samplers in this forest. See the next section for guidance. | 1 / (10 * sampleSize) |
69
+
| numberOfTrees | int | The number of trees in this forest. | 50 |
70
+
| outputAfter | int | The number of points required by stream samplers before results are returned. | 0.25 * sampleSize |
71
+
| internalShinglingEnabled | boolean | Whether the shingling is performed by RCF itself since it has already seen previous values. | false (for historical reasons). Recommended : true, will result in smaller models. |
72
+
| parallelExecutionEnabled | boolean | If true, then the forest will create an internal threadpool. Forest updates and traversals will be submitted to this threadpool, and individual trees will be updated or traversed in parallel. For larger shingle sizes, dimensions, and number of trees, parallelization may improve throughput. We recommend users benchmark against their target use case. | false |
73
+
| randomSeed | long | A seed value used to initialize the random number generators in this forest. ||
74
+
| sampleSize | int | The sample size used by stream samplers in this forest | 256 |
75
+
| centerOfMassEnabled | boolean | If true, then tree nodes in the forest will compute their center of mass as part of tree update operations. | false |
76
+
| storeSequenceIndexesEnabled | boolean | If true, then sequence indexes (ordinals indicating when a point was added to a tree) will be stored in the forest along with poitn values. | false |
77
+
| threadPoolSize | int | The number of threads to use in the internal threadpool. | Number of available processors - 1 |
78
+
79
+
The above parameters are the most common and historical. Please use the issues to request additions/discussions of other parameters of interest.
80
+
81
+
RandomCutForest primarily provides an estimation (say anomaly score, or extrapolation over a forecast horizon) and using that raw estimation can be challenging. The ParkServices package provides
82
+
several capabilities (ThresholdedRandomCutForest, RCFCaster, respectively) for distilling the scores to a determination of
83
+
anomaly/otherwise (an assesment of grade) or calibrated conformal forecasts. These have natural parameter choices that are different
84
+
from the core RandomCutForest -- for example internalShinglingEnabled defaults to true since that is more natural in those contexts.
85
+
The package examples provides a collection of examples and uses of parameters, we draw the attention to ThresholdedMultiDimensionalExample
86
+
and RCFCasterExample. If one is interested in sequential analysis of a series of consecutive inputs, check out SequentialAnomalyExample.
87
+
ParkServices also exposes many other functionalities of RCF which were covert, such as clustering (including multi-centroid representations)
88
+
-- see NumericGLADExample for instance.
89
+
90
+
## Choosing a `timeDecay` value for your application
70
91
71
92
When we submit a point to the sampler, it is included into the sample with some probability, and
72
93
it will remain in the for some number of steps before being replaced. Call the number of steps that
73
94
a point is included in the sample the "lifetime" of the point (which may be 0). Over a finite time
74
95
window, the distribution of the lifetime of a point is approximately exponential with parameter
75
-
`lambda`. Thus, `1 / lambda` is approximately the average number of steps that a point will be included
76
-
in the sample. By default, we set `lambda` equal to `1 / (10 * sampleSize)`.
96
+
`lambda`. Thus, `1 / timmeDecay` is approximately the average number of steps that a point will be included
97
+
in the sample. By default, we set `timeDecay` equal to `1 / (10 * sampleSize)`.
77
98
78
99
Alternatively, if you want the probability that a point survives longer than n steps to be 0.05,
79
100
you can solve for `lambda` in the equation `exp(-lambda * n) = 0.05`.
0 commit comments