Skip to content

Commit 6194047

Browse files
jotokwnbtscswierczChris Swierczewskijmazanec15
authored
Merge main into 2.0 (#250)
* set version to 1.0 (#58) * Improve CLI usage instructions (#59) * Improve CLI usage instructions Expand upon the CLI instructions with an example and information on where to find additional CLI applications. Adds an example data file for ease of instruction. Closes #43 * [rust] Add fundamental types for random cut trees (#169) * [rust] Add fundamental types for random cut trees * [rust] Remove unnecessary Python components in Rust implementation .gitignore * [rust] Cleanup doctests * [rust] whitespace cleanup * [rust] Rename package to random_cut_forest * [rust] Use iterator style for BoundingBox::contains_point * [rust] quick change to kebab-case * Rust CI Workflow (#172) Create a workflow specification for Rust pull requests. Builds package and runs tests. * [rust] Random Cut Trees (#173) * [rust] random cut trees * Fix documentation and trim whitespace * Move PointAdder contents to Tree * Move PointDeleter contents to Tree * Move point addition and deletion to separate impl blocks * Remove use of trait aliases * [rust] Sampled trees and reservoir samplers (#178) * [rust] Sampled trees Implement sampled trees, which are a combination of a point store, a reservoir sampler, and a random cut tree. * Fix re-export and docs * Rename and change time decays and weights to f32s * [rust] Implement the random cut forest type (#188) * [rust] Implement the random cut forest type This commit only implements `update()` on random cut forests. We also implement a random cut forest builder type. * Remove num_trees as a field and derive from trees vec * [rust] Basic visitor-based anomaly score (#192) * [rust] Basic visitor-based anomaly score * Trim trailing whitespace * Update sampler test example (#241) * [Rust] Create a Visitor trait to abstract out scoring algorithms (#217) * [Rust] Create a Visitor trait to abstract out scoring algorithms Most scoring algorithms make use of a visitor design pattern. The Visitor trait makes it easy to perform node traversals with new algorithms. We also add a Tree::traverse() method that makes use of any type that implements Visitor. This addition makes the implementation of anomaly_score() much cleaner. * [Rust] Check for Visitor impl at compile-time * Create a separate Tree::traverse function The way node iteration is implemented requires collecting and reversing the nodes before applying a visitor. We create an independent traverse function specifically for visitors. The existing iter() function is left as-is for testing purposes. Closes #219 * [Rust] Rename algorithm submodule to visitor * [Rust] Fix docstrings * [Rust] Add example CLI script on CSV input (#218) * [Rust] Add CLI-based anomaly score example * [Rust] Include an example usage CLI script for streaming AD * [Rust] Add output_after threshold to RCF (#246) * [Rust] Add output after threshold to RCF * Add test case for output after Co-authored-by: John Mazanec <jmazane@amazon.com> Co-authored-by: Lai <57818076+wnbts@users.noreply.github.com> Co-authored-by: Chris Swierczewski <cswiercz@gmail.com> Co-authored-by: Chris Swierczewski <csw@amazon.com> Co-authored-by: Jack Mazanec <jmazane1@nd.edu> Co-authored-by: John Mazanec <jmazane@amazon.com>
1 parent 43cbe71 commit 6194047

22 files changed

+5589
-53
lines changed

.github/workflows/rust.yml

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: Rust CI
2+
3+
on:
4+
pull_request:
5+
branches: [ main ]
6+
paths: [ Rust/** ]
7+
8+
env:
9+
CARGO_TERM_COLOR: always
10+
11+
defaults:
12+
run:
13+
working-directory: Rust/
14+
15+
jobs:
16+
build:
17+
18+
runs-on: ubuntu-latest
19+
20+
steps:
21+
- uses: actions/checkout@v2
22+
- name: Build Rust
23+
run: cargo build --verbose
24+
- name: Run Rust Tests
25+
run: cargo test --verbose

Java/README.md

+57-53
Original file line numberDiff line numberDiff line change
@@ -102,18 +102,60 @@ mvn package -DexcludedGroups=functional
102102

103103
## Build Command-line (CLI) usage
104104

105-
For some of the algorithms included in this package, there are CLI applications that can
106-
be used for experiments. These applications use `String::split` to read
107-
delimited data, and as such are **not intended for production use**. Instead,
108-
use these applications as example code and as a way to learn about the
109-
algorithms and their hyperparameters.
105+
> **Important.** The CLI applications use `String::split` to read delimited data
106+
> and as such are **not intended for production use**.
110107
111-
After building the project (described in the previous section), you can invoke an example CLI application by adding the
112-
core jar file to your classpath. For example:
108+
For some of the algorithms included in this package there are CLI applications
109+
that can be used for experimentation as well as a way to learn about these
110+
algorithms and their hyperparameters. After building the project you can invoke
111+
an example CLI application by adding the core jar file to your classpath.
112+
113+
In the example below we train and score a Random Cut Forest model on the
114+
three-dimensional data shown in Figure 3 in the original RCF paper.
115+
([PDF][rcf-paper]) These example data can be
116+
found at `../example-data/rcf-paper.csv`:
113117

114118
```text
115-
% java -cp core/target/randomcutforest-core-1.0-alpha.jar com.amazon.randomcutforest.runner.AnomalyScoreRunner --help
116-
Usage: java -cp randomcutforest-core-1.0-alpha.jar com.amazon.randomcutforest.runner.AnomalyScoreRunner [options] < input_file > output_file
119+
$ tail data/example.csv
120+
-5.0074,-0.0038,-0.0237
121+
-5.0029,0.0170,-0.0057
122+
-4.9975,-0.0102,-0.0065
123+
4.9878,0.0136,-0.0087
124+
5.0118,0.0098,-0.0057
125+
0.0158,0.0061,0.0091
126+
5.0167,0.0041,0.0054
127+
-4.9947,0.0126,-0.0010
128+
-5.0209,0.0004,-0.0033
129+
4.9923,-0.0142,0.0030
130+
```
131+
132+
(Note that there is one data point above that is not like the others.) The
133+
`AnomalyScoreRunner` application reads in each line of the input data as a
134+
vector data point, scores the data point, and then updates the model with this
135+
point. The program output appends a column of anomaly scores to the input:
136+
137+
```text
138+
$ java -cp core/target/randomcutforest-core-1.0.jar com.amazon.randomcutforest.runner.AnomalyScoreRunner < ../example-data/rcf-paper.csv > example_output.csv
139+
$ tail example_output.csv
140+
-5.0029,0.0170,-0.0057,0.8129401629464965
141+
-4.9975,-0.0102,-0.0065,0.6591046054520615
142+
4.9878,0.0136,-0.0087,0.8552217070518414
143+
5.0118,0.0098,-0.0057,0.7224686064066762
144+
0.0158,0.0061,0.0091,2.8299054033889814
145+
5.0167,0.0041,0.0054,0.7571453322237215
146+
-4.9947,0.0126,-0.0010,0.7259960347128676
147+
-5.0209,0.0004,-0.0033,0.9119498264685114
148+
4.9923,-0.0142,0.0030,0.7310102658466711
149+
Done.
150+
```
151+
152+
(As you can see the anomalous data point was given large anomaly score.) You can
153+
read additional usage instructions, including options for setting model
154+
hyperparameters, using the `--help` flag:
155+
156+
```text
157+
$ java -cp core/target/randomcutforest-core-1.0.jar com.amazon.randomcutforest.runner.AnomalyScoreRunner --help
158+
Usage: java -cp target/random-cut-forest-1.0.jar com.amazon.randomcutforest.runner.AnomalyScoreRunner [options] < input_file > output_file
117159
118160
Compute scalar anomaly scores from the input rows and append them to the output rows.
119161
@@ -130,6 +172,9 @@ Options:
130172
--help, -h: Print this help message and exit.
131173
```
132174

175+
Other CLI applications are available in the `com.amazon.randomcutforest.runner`
176+
package.
177+
133178
## Testing
134179

135180
The core library test suite is divided into unit tests and "functional" tests. By "functional", we mean tests that
@@ -158,13 +203,13 @@ Test dependencies will be downloaded automatically when invoking `mvn test` or `
158203

159204
## Benchmarks
160205

161-
The benchmark module defines microbenchmarks using the [JMH](https://openjdk.java.net/projects/code-tools/jmh/)
206+
The benchmark modules defines microbenchmarks using the [JMH](https://openjdk.java.net/projects/code-tools/jmh/)
162207
framework. Build an executable jar containing the benchmark code by running
163208

164209
```text
165210
% # (Optional) To benchmark the code in your local repository, build and install to your local Maven repository
166211
% # Otherwise, benchmark dependencies will be pulled from Maven central
167-
% mvn package install -DexcludedGroups=functional -Dgpg.skip
212+
% mvn package install -DexcludedGroups=functional
168213
%
169214
% mvn -pl benchmark package assembly:single
170215
```
@@ -182,45 +227,4 @@ benchmark methods will be executed.
182227
% java -jar benchmark/target/randomcutforest-benchmark-1.0-jar-with-dependencies.jar RandomCutForestBenchmark\.updateAndGetAnomalyScore
183228
```
184229

185-
### Custom Profilers
186-
187-
This library defines two custom JMH profilers for use in benchmarks:
188-
189-
| Name | Benchmarks | Description | Command-line Example |
190-
| ---- | ---------- | ----------- | ------------ |
191-
| OutputSizeProfiler | StateMapperBenchmark | Measures the length of a String or byte array | `java -jar benchmark/target/randomcutforest-benchmark-1.0-jar-with-dependencies.jar StateMapperBenchmark -prof com.amazon.randomcutforest.profilers.OutputSizeProfiler` |
192-
| ObjectGraphSizeProfiler | StateMapperBenchmark | Wraps the `MemoryMeter::measureDeep` method in the [JAMM](https://github.com/jbellis/jamm) library to measure the amount of memory allocated in an object graph. When using this profiler, you need to set the `javaagent` flag to point to the location of the JAMM JAR file. | `java -javaagent:$HOME/.m2/repository/com/github/jbellis/jamm/0.3.3/jamm-0.3.3.jar -jar benchmark/target/randomcutforest-benchmark-1.0-jar-with-dependencies.jar StateMapperBenchmark -prof com.amazon.randomcutforest.profilers.ObjectGraphSizeProfiler`
193-
194-
Note that you can enable OutputSizeProfiler and ObjectGraphSizeProfiler at the same time by adding their respective `-prof` flags to the command-line.
195-
196-
## Examples
197-
198-
The examples module provides runnable code examples using the library. Build an executable jar containing the
199-
examples by running:
200-
201-
```text
202-
% # (Optional) To run examples using code in your local repository, build and install to your local Maven repository
203-
% # Otherwise, dependencies will be pulled from Maven central
204-
% mvn package install -DexcludedGroups=functional -Dgpg.skip
205-
%
206-
% mvn -pl examples package assembly:single
207-
```
208-
209-
To see a list of examples:
210-
211-
```text
212-
% java -jar examples/target/randomcutforest-examples-1.0-jar-with-dependencies.jar
213-
Usage: java -cp randomcutforest-examples-1.0.jar [example]
214-
Examples:
215-
json - serialize a Random Cut Forest as a JSON string
216-
protostuff - serialize a Random Cut Forest with the protostuff library
217-
```
218-
219-
To run an example, provide the example name:
220-
221-
```text
222-
% java -jar examples/target/randomcutforest-examples-1.0-alpha-jar-with-dependencies.jar json
223-
dimensions = 4, numberOfTrees = 50, sampleSize = 256, precision = DOUBLE
224-
JSON size = 550295 bytes
225-
Looks good!
226-
```
230+
[rcf-paper]: http://proceedings.mlr.press/v48/guha16.pdf

Rust/.gitignore

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
################################################################################
2+
# Additional Ignores
3+
################################################################################
4+
*~
5+
.vscode/
6+
7+
################################################################################
8+
# GitHub Rust GitIgnore
9+
################################################################################
10+
# Generated by Cargo
11+
# will have compiled files and executables
12+
debug/
13+
target/
14+
15+
# Remove Cargo.lock from gitignore if creating an executable, leave it for libraries
16+
# More information here https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lock.html
17+
Cargo.lock
18+
19+
# These are backup files generated by rustfmt
20+
**/*.rs.bk

Rust/Cargo.toml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[package]
2+
name = "random-cut-forest"
3+
version = "0.1.0"
4+
authors = ["Chris Swierczewski <csw@amazon.com>"]
5+
edition = "2018"
6+
7+
[dependencies]
8+
num-traits = "0.2"
9+
rand = "0.8.3"
10+
rand_chacha = "0.3.0"
11+
rand_distr = "0.4.0"
12+
slab = "0.4.2"
13+
14+
[dev-dependencies]
15+
clap = "3.0.0-beta.2"
16+
csv = "1.1"

Rust/README.md

+64
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Random Cut Forest
2+
3+
This directory contains a Rust implementation of the Random Cut Forest (RCF)
4+
data structure and algorithms for anomaly detection, denstiy estimation,
5+
imputation, and forecast. The goal of this package is to provide a
6+
high-performance implementation of RCF in Rust as well as the backend for the
7+
Python bindings also contained in this repository.
8+
9+
## Usage
10+
11+
To use this library, add the following to your `Cargo.toml`:
12+
13+
```toml
14+
[dependencies]
15+
random-cut-forest = "0.1.0"
16+
```
17+
18+
The two main types provided by this package are `RandomCutForest` and
19+
`RandomCutForestBuilder`. The latter creates a `RandomCutForest` using a
20+
combination of required and optional construction parameters.
21+
22+
Below is an example showing RCF construction, training, and anomaly scoring.
23+
24+
```rust
25+
use random_cut_forest::{RandomCutForest, RandomCutForestBuilder};
26+
27+
// build a random cut forest. the dimension is the only required parameter
28+
let mut rcf: RandomCutForest<f32> = RandomCutForestBuilder::new(2)
29+
.sample_size(256) // # of samples per tree
30+
.num_trees(50) // # of trees in the model
31+
.build(); // build forest from configuration
32+
33+
// train the model on a collection of vectors
34+
for point in data.iter() {
35+
rcf.update(point.clone());
36+
}
37+
38+
// compute anomaly scores using the trained model
39+
let anomaly_scores: Vec<f32> = data.iter()
40+
.map(|p| rcf.anomaly_score(p))
41+
.collect();
42+
```
43+
44+
## Examples and CLI Programs
45+
46+
See the `examples/` directory for example usage of this package. Some examples
47+
can be run as command-line programs. Try running,
48+
49+
```sh
50+
$ cargo run --release --example [EXAMPLE_NAME] -- --help
51+
```
52+
53+
to see example-specific usage instructions. The `--release` build significantly
54+
improves performance of these example CLI tools, especially if you are running
55+
these scripts on larger data sets. Note that these example scripts are ***not
56+
intended for production use***.
57+
58+
## References
59+
60+
* Guha, Sudipto, Nina Mishra, Gourav Roy, and Okke Schrijvers. *"Robust random
61+
cut forest based anomaly detection on streams."* In International conference
62+
on machine learning, pp. 2712-2721. PMLR, 2016. ([pdf][rcf-paper])
63+
64+
[rcf-paper]: http://proceedings.mlr.press/v48/guha16.pdf
+88
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
//! Streaming anomaly scores command line application.
2+
//!
3+
//! This example shows how to read data from an input CSV file and output
4+
//! streaming anomaly scores. By "streaming", we mean that each observation is
5+
//! first scored and then the model is updated with the observation.
6+
//!
7+
//! In this example, we use the `clap` package for a basic CLI. We use the `csv`
8+
//! package to parse the input CSV data to be fed into an RCF model.
9+
//!
10+
extern crate clap;
11+
use clap::{AppSettings, Clap};
12+
13+
extern crate csv;
14+
15+
use random_cut_forest::{RandomCutForest, RandomCutForestBuilder};
16+
17+
use std::error::Error;
18+
use std::io;
19+
use std::process;
20+
21+
/// Streaming random cut forest anomaly scoring.
22+
///
23+
/// Comma-delimited data is accepted via stdin. Anomaly score are output to
24+
/// stdout. To read from file use the standard redirects. CSV headers are
25+
/// automatically ignored. Many data contains a timestamp column in the first
26+
/// column. The --ignore-first-column flag is useful in this situation.
27+
///
28+
#[derive(Clap)]
29+
#[clap(setting=AppSettings::ColoredHelp)]
30+
struct Opts {
31+
/// Dimensionality of the input
32+
#[clap(short, long)]
33+
dimension: usize,
34+
35+
/// Number of trees used in the model
36+
#[clap(short, long, default_value="50")]
37+
num_trees: usize,
38+
39+
/// Number of samples per tree
40+
#[clap(short, long, default_value="256")]
41+
sample_size: usize,
42+
43+
/// Parameter for time-decay reservoir sampling
44+
#[clap(short, long, default_value="0.000390625")]
45+
time_decay: f32,
46+
47+
/// Ignore the first column of input. (e.g. timestamps)
48+
#[clap(long)]
49+
ignore_first_column: bool,
50+
}
51+
52+
fn run(rcf: &mut RandomCutForest<f32>, ignore_first_column: bool) -> Result<(), Box<dyn Error>> {
53+
let dimension = rcf.dimension();
54+
let start_index: usize = match ignore_first_column {
55+
true => 1,
56+
false => 0,
57+
};
58+
59+
let mut rdr = csv::Reader::from_reader(io::stdin());
60+
for result in rdr.records() {
61+
let record = result?;
62+
63+
let mut point: Vec<f32> = Vec::with_capacity(dimension);
64+
for i in start_index..(dimension + start_index) {
65+
let value: f32 = record.get(i).unwrap().parse::<f32>().unwrap();
66+
point.push(value);
67+
}
68+
69+
let score = rcf.anomaly_score(&point);
70+
rcf.update(point);
71+
println!("{}", score);
72+
}
73+
Ok(())
74+
}
75+
76+
fn main() {
77+
let opts = Opts::parse();
78+
let mut rcf: RandomCutForest<f32> = RandomCutForestBuilder::new(opts.dimension)
79+
.num_trees(opts.num_trees)
80+
.sample_size(opts.sample_size)
81+
.time_decay(opts.time_decay)
82+
.build();
83+
84+
if let Err(err) = run(&mut rcf, opts.ignore_first_column) {
85+
println!("error running example: {}", err);
86+
process::exit(1);
87+
}
88+
}

0 commit comments

Comments
 (0)