Data pipeline to replicate the NPM Registry using Python, CouchDB, Kafka, Kafka-ui, Prometheus and Grafanan. Meant to be used for research purposes.
The data_pipeline/npm-mirror folder contains all the critical files such that the whole pipeline can be run through multiple docker containers using a single command. The files in there are as follows-
- node_app/producer.ts - contains the code to read changes from the NPM Registry and add them to a Kafka topic
- app/changes_consumer.py - contains the code to consume change messages added to the Kafka stream and process each change by downloading the metadata and tarball corresponding to the change, saving these files in the appropriate directory structure and adding the corresponding record to the mirrored database. The script also produces running logs through various kafka topics.
- app/kafka_admin.py - to run Kafka broker and view its metadata
- node_app/config.json - read by producer to see where to start reading changes from, based on the NPM API sequence ID. This seq ID is read only if the Kafka topic 'npm-changes' is empty.If 'npm-changes' is not empty, producer starts from the seq ID of the last message stored in the 'npm-changes' topic.
- dbdata - stores the mirrored database files
- Dockerfile - contains the configuration to build the Python scripts in the app subdirectory as a docker container image
- docker-compose.yml - contains container configuration for running Python scripts, CouchDB, Kafka, Kafka-ui, Prometheus and Grafana
- prometheus.yml - contains prometheus configuration
- requirements.txt - contains packages needed to run the Python scripts
- run_scripts.sh - contains the actual commands that are run in the Python scripts container to run the various Python scripts appropriately.
- update_seq/* - stores the seq ID of the last message processed along with the full last message json
Some important parameters that can be configured in the Python scripts -
- LOCAL_PACKAGE_DIR - the path where the files are locally downloaded temporarily for compression
- REMOTE_PACKAGE_DIR - the remote file directory path where the compressed files are finally stored
- MAX_SIZE - the maximum file such that packages with a size bigger than this are not stored on the server
- KAFKA_TOPIC_NUM_PARTITIONS - number of partitions each Kafka topic has
- SUBDIRECTORY_HASH_LENGTH - number of characters based on which the packages are hashed while storing in the remote directory in order to have better organization of packages for quicker access from the file system
- OLD_PACKAGE_VERSIONS_LIMIT - limits the max number of package versions to be kept in the remote file directory
- DATABASE_NAME - the name of the database on the couchdb server under which the changes are recorded
- config.json/update_seq - stores the seq ID from which the producer will start if 'npm-changes' is empty. Set to 'now' if the producer has to be started from the latest seq ID or else set to any specific seq ID as a string
Notes -
- The files are first downloaded and compressed locally first to facilitate faster transfer to the remote directory
- If a kafka topic already exists, it's number of partitions won't change by changing the above parameter. The topic would have to be deleted first and then reinitialized for the change to take effect.
Move to the directory containing the docker-compose.yml -
cd /home/adeepb/data_pipeline/NPM-Mirror/data_pipeline/npm-mirror
To build / rebuild the docker containers-
docker build -t npm-mirror .
To attach to tmux session-
tmux attach -t npm-mirror-run
To run the containers-
docker-compose up --quiet-pull
To detach from tmux session-
Control + B followed by D
To enter terminal for running container-
docker exec -it npm-mirror_npm-mirror_1 /bin/bash
To view container logs-
docker-compose logs npm-mirror
To delete CouchDB data-
cd /home/adeepb/data_pipeline/NPM-Mirror/data_pipeline/npm-mirror
sudo rm -rf dbdata
mkdir dbdata
To delete downloaded packages / jsons-
cd /NPM
sudo rm -rf npm-packages
mkdir npm-packages
To reset kafka broker-
docker-compose rm broker-npm
To reset zookeeper-
docker-compose rm zookeeper-npm
To reset producer / consumer-
docker-compose rm npm-mirror
Note - check the config parameters before running specially config.json/update_seq
To attach to tmux session-
tmux attach -t npm-mirror-run
Control + C to stop all running containers (make sure to be in the npm-mirror directory where the docker-compose file is present)
To restart the containers-
docker-compose up --quiet-pull
To detach from tmux session-
Control + B followed by D
Note - If the service has been down for a while, the Kafka messages would be cleared out while the old zookeeper offsets may persist so need to remove the Kafka broker and zookeeper containers if connection refused error is displayed upon docker compose up.
Sometimes the Kafka producer might get disconnected from the NPM API and stop reading new changes after having made too many requests. In such a case, the producer automatically gets restarted if the lag between the last seq ID sent to kafka broker and the last seq ID in the NPM grows by more than 200 from the initial lag.
To check initial lag-
cd /home/adeepb/data_pipeline/NPM-Mirror/data_pipeline/npm-mirror
docker-compose logs npm-mirror | grep 'Lag' | awk '{print}'
- View Database UI- http://127.0.0.1:5984/_utils/
- View Kafka-ui - http://localhost:8081/
- View Prometheus - http://feature.isri.cmu.edu:9090/targets?search=
- View Grafana Dashboard- http://localhost:3001/
Note: Appropriate port forwarding from the server's (VM's) port to the local computer port will be needed to view these on browser