This project implements a RESTful data aggregator. We upload a file with a large number of records 64MM either as a csv or a json file. We specify a column to group on, a column to aggregate. We return the aggregated data in either csv or json.
POST /api/v1/upload/ The endpoint to upload a large file. (Either csv or json) POST api/v1/aggregate/ The endpoint to perform aggregation on the previously uploaded file.
- Upload File:
Request: curl -i -F file=@ -F "format=" /api/v1/upload/ Response: {"status":"Accepted", "url":"/upload/666b3b22-f161-11e5-9670-060c1144530b", "token":"666b3b22-f161-11e5-9670-060c1144530b"}
On successful upload the server returns a token back to the client. The client has to send a token to the server when it wants to perform aggregation
- Aggregate File:
Request: curl -d "token=666b3b22-f161-11e5-9670-060c1144530b&aggOn=count&grpOn=last_name&outType=csv" /api/v1/aggregate/ Response: Either csv or json aggregated stream file download.
The client passes the token, groupOn, AggregateOn parameters and Type to indicate the format it expects the results back in.
Sample File: ------------ first_name,last_name,count
Luke,Skywalker,42
Leia,Skywalker,10
Anakin,Skywalker,20
Admiral,Ackbar,10
Admiral,Tharwn,10
Kylo,Ren,100
Command: ------- groupOn=last_name , aggregateOn=count Output: ------- Skywalker:72
Ackbar:10
Thrawn:10
Ren:100
All the functionality can be found in: RESTfulDataAggregator/aggregator/api/views.py Routing rules are in: RESTfulDataAggregator/aggregator/aggregator/urls.py
You can also find utils folder which has test data generation scripts.
python RESTfulDataAggregator/utils/TestDataGenerator/dataGenerator.py --fileType csv --fileSize 10
We specify the file type and the size of the test data file in MB. The script generates first_name,last_name,count csv file of the required size by randomizing the data from
RESTfulDataAggregator/utils/TestDataGenerator/firstNames.in and RESTfulDataAggregator/utils/TestDataGenerator/lastNames.in
You can find the output of this script at: RESTfulDataAggregator/utils/DataSource/TestData.csv
For a 1GB test data we have a whopping 64MM records and the aggregation happens in about 30 seconds on an AWS EC2 Instance