diff --git a/.gitignore b/.gitignore index c7a203c..8fd8e2a 100644 --- a/.gitignore +++ b/.gitignore @@ -140,5 +140,6 @@ cython_debug/ # IDEs .idea -# outputs -outputs/ \ No newline at end of file +# itsybitsy +outputs/ +.lastrun.json diff --git a/README.md b/README.md index ea6f13a..e268a3a 100644 --- a/README.md +++ b/README.md @@ -10,14 +10,11 @@ Configure charlotte, give it a seed node, and it crawls the graph/tree of your s * python >= 3.8 was chosen in order to use unittest.mock AsyncMock * dot/graphviz binaries installed in system PATH (e.g. `brew install graphviz`) - -## Configure itsybitsy in 8 easy steps! -1. Clone itsybitsy - 1. `git clone git@github.com/life360/itsybitsy` +## Configure itsybitsy in 7 easy steps! 1. Review the example project in [examples/example-project(examples/example-project)] 1. Start a new project / empty folder 1. `mkdir myitsybitsy && cd myitsybitsy` - 1. `echo "-e /Users/patrick/repos/itsybitsy" > requirements.txt` + 1. `echo "git+ssh://git@github.com/life360/itsybitsy.git#egg=itsybitsy" > requirements.txt` 1. `pip install -r requirements.txt` 1. Configure charlotte - the configuration engine with which you will describe your service graph to itsybitsy 1. `mkdir charlotte.d` @@ -31,7 +28,6 @@ Configure charlotte, give it a seed node, and it crawls the graph/tree of your s 1. Hint: `spider.conf` is always inherited, but you can create different profiles such as `spider.prod.conf` and reference them with the `--profile` arg 1. Note: unlike the `spider` command, `render` is written to stand alone and parse the default json file in `outputs/.lastrun.json` it requires no arguments by default. - ## Use #### 1 Run in `spider` mode: diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..44080e1 --- /dev/null +++ b/TODO.md @@ -0,0 +1,216 @@ +`# V1 + +* [x] output the graph basic +* [x] graph output prettier +* [x] combine http and https backends +* [x] add --max-depth +* [x] detect service name from /etc/chef/client.rb +* [x] add nsq graph +* [x] add --skip-nsq-topics +* [x] detect defunct children via haproxy stats +* [x] skip display defunct children +* [x] haproxy 1.6 and 1.8 compatability +* [x] add error state for "null" NSQ clients (use "?") +* [x] detect missing stats socket haproxy +* [x] task: validate missing haproxy stats socket config is live manually with knife ssh +* [x] task: validate missing haproxy stats socket is live manually with knife ssh +* [x] add "no consumer" NSQ detection +* [x] task: validate DEFUNCT-ness +* [x] multiple seeds + +# V2.async +* [x] use asyncssh (10x faster!) +* [x] remove `return_exceptions=True` +* [x] use global BASTION connection (https://github.com/ronf/asyncssh/issues/270) +* [x] limit concurrency w/ semaphore +* [x] split to modules +* [x] re-use ssh connection for get-name/get-config calls +* [x] pass lightweight node-ref through async calls instead of node dict +* [x] remove pending node print +* [x] deal with formatting/output-ordering implications +* [x] convert recursive crawl from `await` to `ensure_future` +* [x] improve live output rendering +* [x] fix introduced parent['last_sibling'] bug +* [x] bug: cycle is correct in the tree, but rendering zombie children (only for first level cycles?) +* [x] retry ssh connection 3 times, fine tune concurrency +* [x] introduced: --output=stdout is now broken due to render_node_live +* [x] rename water to water_spout, private module function +* [x] consolidate `find..children` error checking +* [x] validate frontend-router +* [x] move connection semaphore to ssh_layer +* [x] better trace/debug log levels +* [x] consolidate nsq node relationships w/ multiple connections +* [x] deal w/ SSH config: bastion & username +* [x] refactors from PR review (reduce complexity, procedural styling) + +# V2.features +* [x] DISPLAY: output in json +* [x] DISPLAY: load json file +* [x] DISPLAY: output in graphviz +* [x] DISPLAY: graphviz source +* [x] CRAWL: detect proxysql +* [x] CRAWL: cassandra +* [x] CRAWL: detect well known ports w/ netstat & AWS name lookup (cx, memcache, redis) +* [x] CRAWL: detect postgres well known port - causing trouble w/ name lookup +* [x] CRAWL: user defined links +* [x] move hints/skips to web.yaml +* [x] keep config.yaml +* [x] CRAWL: kinesis + +# V2.refactor +* [x] move grouping of nsq topics to application layer, on service_name instead of IP +* [x] `config_errors` -> `warnings`, `crawl_errors` -> `errors` +* [x] refactor ssh config to ssh config file +* [x] refactor --hide-defunct to --skip-defunct and do not even (crawl) +* [x] graphviz warn/error color coding +* [x] remove "cruft" handling +* [x] add quick filter to rewrite service_name mysql-main-port_3306 to mysql-main-r/o +* [x] create objects or named tuples (dataclasses!) +* [x] PEP8, 120 line length +* [x] CHARLOTTE: make the `get_config` function into configurable parsers definable in YAML +* [x] charlotte: replace 'null' response from NSQ for missing IP w/ actual None response +* [x] charlotte: move crawl strategy exceptions (frontend-router) into charlotte +* [x] charlotte: move blocking logic to charlotte +* [x] charlotte: rename crawl_strategy -> crawl_provider on Node() +* [x] charlotte: move service_name_rewrite to charlotte +* [x] rename protocol_detail -> protocol_mux +* [x] CHARLOTTE: --skip-{name} arguments +* [x] --skip-defunct -> --hide-defunct +* [x] refactor database named matching to port matching +* [x] move skip services from globals to argparse +* [x] move crawl_complete, name_lookup_complete to node.py +* [x] charlotte config 1 file to directory of yaml files +* [x] create default yaml file for argparse +* [x] rename `ip` -> `instance_address` +* [x] remove crawl strategy object from Node, denormalize (protocol, blocking) +* [x] merge hints into pre-existing children w/ unknown address +* [x] CORE: add sub commands for ['crawl', 'render-json'] +* [x] CORE (OSS): unit tests tests tests (round I - excluding `provider_*.py` and `crawl.py`) + + +# V2.bugs +* [x] BUG: nsq channels on same node are not grouping, again! +* [x] there is a regression in cycle detection - spider against async-cake-handler to repro +* [x] trim double quotes from service_name +* [x] BUG: crawl of well known port is discovering random connections to frontend-routers, ELBs - fixed by chris r. source ephemeral port filter +* [x] `'CYCLE': f"service '{node['service_name']}' discovered as a parent of itself!",` +* [x] paramiko nested exception outputting +* [x] handle actually null (absent value) nsq consumer in additionn to string literal "null" +* [x] ascii renderer grouping by detail is persisting in memory (groupings) +* [x] charlotte: move name parser expections (mysql-main) into charlotte +* [x] we see many repeating group by service-name NSQ topic/channels repeating in ascii renderer +* [x] catch timeout for crawling children +* [x] remove trailing `_` from node_ref +* [x] graphviz blocking is backwards +* [x] regression defunct in parser check on num_connections == 0 is failing +* [x] differentiate RDS databases found in AWS - currently all show as `rdsnetworkinterface` +* [x] BUG: add __type__ to json serialization - currently brittle: key-ing off of random fields for deserialization +* [x] infinited recursion bug introduced by the crawl hints. it had to do with the cached_nodes in crawl.py being by_ref object and a deep-ish copy fixed +* [x] trying to crawl json that was outputted with --depth arg results in hanging `wait_for_crawl` to complete on nodes + +# V3 Kubernetes++ +* [x] CRAWL: kubernetes - take a hint +* [x] CRAWL: kubernetes - name lookup, crawl +* [x] support EKS cluster in a different AWS account than provider_aws + + +# V3.refactor +* [x] static code analysis (prospector) and forthcoming changes +* [x] refactor providers to objects, remove SSH logic from crawl.py +* [x] caching children in crawl.py instead of providers!! +* [x] fix TIMEOUT logic +* [x] put provider_args back in crawl strategies! use **kwargs to pass args in code +* [x] rewrite provider registration +* [x] move provider constant refs from constants.py into providers +* [x] rename errors.NULL_IP NULL_ADDRESS +* [x] refactor signature of `crawl_downstream` to include address +* [x] replace pass through node_ref in crawl w/ `zip()` +* [x] unit tests for crawl, providers, provider_*? +* [x] validate that crawl strategies are only used for specified providers +* [x] refactor lookup_name to remove life360 business logic from providers! +* [x] remove ProviderInterface::configure(), have ssh configure itself on first query +* [x] seed provider is configurable command line arg w/ + +# V3.features +* [x] FEATURE: make instance_provider args for aws hints part of a refactored "profile" +* [x] FEATURE: Distinguish kubernetes service shape in graphviz +* [ ] add --stop-on-nonblocking CLI arg + +# V3.bugs +* [x] not respecting CrawlStrategy.providers +* [x] need to be able to configure different AWS profile for k8s/eks than for aws! (for dev) +* [x] BUG: intermittent timeout exceptions which do not result in program exit + +# V4.VOSS +* [x] REFACTOR: (providers): providers as plugin architecture +* [x] REFACTOR (spider): --concurrency -> --ssh-concurrency OR provider args +* [x] REFACTOR: (all): refactor package architecture +* [x] TIMEOUT: (crawl) robust provider timeout and exception handling +* [x] OBSCURIFIER (render_*): obscurifier for output +* [ ] LOGGER: rewrite logger access for community standards +* [ ] PLUGPLAY: out of the box functionality by moving TCP to a "builtin" CrawlStrategy and using `hostname` or default service name +* [ ] REFACTOR: (providers): rewrite take_a_hint to not return a list, just return a single NodeTransport +* [ ] DOCS: rewrite docs in sphinx style and prepare for export to readthedocs.org + +# Backlog + +## New Features + +## Core +* [ ] RENDER_PLUGINS: make renderer's an abstract class w/ plugins +* [ ] REFACTOR: move seed logic out of ./spider.py +* [ ] REFACTOR: revisit the Node{Protocol, CrawlStrategy, protocol_mux} object relationship strategy +* [ ] FEATURE: track whether a node was skipped for crawling and display as such in graphviz +* [ ] REFACTOR: move errors/warnings to a global config +* [ ] REFACTOR: do not block crawl() on lookup_name() in main crawl loop. will speed up many times +* [ ] REFACTOR: move mutex from provider_ssh to crawl.py +* [ ] BUG: intermittent timeouts crawling the whole tree - add retry to lookup_name/crawl_downstream? +* [ ] BUG: remove `blocking` from CrawlStrategy - it should only be in Protocol +* [ ] BUG: where is `elasticache-time-points`? crawl-netstat only takes 1 ip per port, so for async-soa which has 2 downstreams on 6379, it can't find +* [ ] BUG: where is `cx-dvb`?? +* [ ] REFACTOR: consolidate Node::crawl_complete and crawl.py::_crawlable() +* [ ] BUG: required args showing as optional in --help + +## Remder Ascii +* [ ] FEATURE: merge hints in ascii output + +## Render Graphviz +* [ ] FEATURE: multiple seeds display with equal ranking +* [ ] FEATURE: nsq topics as nodes rather than edges +* [ ] FEATURE: visualize cycles +* [ ] FEATURE: different visualization for cache vs database +* [ ] FEATURE: create a legend + +## Render JSON + +## Render New +* [ ] DISPLAY: output in vizceral format +* [ ] DISPLAY: 'diff' run on multiple seed nodes and diff! + +## CrawlStrategies +* [ ] BUG: HAProxy: functionality to detect bad HAProxy Config as a crawl error was lost in async refactor `if stdout.startswith('ERROR:'): return 'CRAWL ' + stdout.replace("\n","\t"), {}` +* [ ] BUG: NSQ: misconfigured clients have null server (this is why we don't see rattail -> relapse), investigate & resolve +* [ ] FEATURE: Netstat: use matchAddress for HAProxy crawl strategies to avoid timeout to RDS hostnames +* [ ] FEATURE: crawl downstream - ability to specify more providers args per provider (so that k8s can selectively crawl containers) + + +## Provider SSH +* [ ] FEATURE: revisit whether `occupy_one_sempahore_space` is working (to dynamically configure --concurrency) +* [ ] FEATURE: still getting ssh connections errors sometimes with out --concurrency=10 +* [ ] FEATURE: configurable "~/.ssh/config" SSH profile +* [ ] REFACTOR (provider_ssh): we shouldn't use known_hosts=None for security reasons + +## Provider AWS +* [ ] FEATURE: lookup_name is slow, use async +* [ ] CRAWL: dynamodb +* [ ] CRAWL: SQS + +## Charlotte +* [ ] FEATURE (charlotte): yaml validation by schema + +## Web + +# Trash Can +* [ ] backwards compatability for haproxy w/out stats socket +* [ ] detect live traffic netstat/tcpdump/ebpf? (this was solved by using haproxy stats) +* [ ] remove crawl_strategy from Node()