Skip to content

Latest commit



69 lines (48 loc) · 3.55 KB

File metadata and controls

69 lines (48 loc) · 3.55 KB

Distributed Character RNN

This is a port for running a character rnn with distributed tensorflow. Based on the original code from


Basic Usage

1. For launching a distributed training environment

Step 1: First make sure your data is sharded if you want data parallel training. For doing that run the following command:

# Call with -h option for more help
python --data_dir data/tinyshakespeare --num_parts 2 --out_dir sharded_data

This will create data-<num>.npy files in the out_dir. num is the number of partition. Note that vocabulary is not partitioned since it should be shared across all nodes and must be same globally. If vocab creation is left to runtime, it will differ for each partition.

Step 2: You need to launch each node as a different process. The command for launching any node is

 python --distributed --ps_hosts --worker_hosts, --job_name $job_name --task_index $task_index --save_dir distrib-train

OR, execute the file launch.bat or to quickly launch a distributed experiment with default settings.

The options --job_name takes value either ps or worker based on the node's role. Refer to this TF tutorial for more info on these roles. Similarly --task_index takes an integer indicating which node it is. ith worker node takes value i.

For more options run python --help. Note any options you set must be same across all nodes except for node dependent settings like job_name, task_index, etc.

2. For running on a single process; without the distributed mode

To train with default parameters on the tinyshakespeare corpus, run python To access all the parameters use python --help.

To sample from a checkpointed model, python Sampling while the learning is still in progress (to check last checkpoint) works only in CPU or using another GPU. To force CPU mode, use export CUDA_VISIBLE_DEVICES="" and unset CUDA_VISIBLE_DEVICES afterward (resp. set CUDA_VISIBLE_DEVICES="" and set CUDA_VISIBLE_DEVICES= on Windows).

To continue training after interruption or to run on more epochs, python --init_from=save


You can use any plain text file as input. For example you could download The complete Sherlock Holmes as such:

cd data
mkdir sherlock
cd sherlock
mv cnus.txt input.txt

Then start train from the top level directory using python --data_dir=./data/sherlock/

A quick tip to concatenate many small disparate .txt files into one large training file: ls *.txt | xargs -L 1 cat >> input.txt.


To visualize training progress, model graphs, and internal state histograms: fire up Tensorboard and point it at your log_dir. E.g.:

$ tensorboard --logdir=./logs/

Then open a browser to http://localhost:6006 or the correct IP/Port specified.


Feel free to send pull requests. Especially related to simplifying the setup as much as possible.