update README, import TODO, update .gitignore

life360-oss · Jul 2, 2020 · 5a2a263 · 5a2a263
1 parent b0de5a6
commit 5a2a263
Show file tree

Hide file tree

Showing 3 changed files with 221 additions and 8 deletions.
diff --git a/.gitignore b/.gitignore
@@ -140,5 +140,6 @@ cython_debug/
 # IDEs
 .idea
 
-# outputs
-outputs/
+# itsybitsy
+outputs/
+.lastrun.json
diff --git a/README.md b/README.md
@@ -10,14 +10,11 @@ Configure charlotte, give it a seed node, and it crawls the graph/tree of your s
   * python >= 3.8 was chosen in order to use unittest.mock AsyncMock
 * dot/graphviz binaries installed in system PATH (e.g. `brew install graphviz`)
 
-
-## Configure itsybitsy in 8 easy steps!
-1. Clone itsybitsy
-    1. `git clone git@github.com/life360/itsybitsy`
+## Configure itsybitsy in 7 easy steps!
 1. Review the example project in [examples/example-project(examples/example-project)]
 1. Start a new project / empty folder
     1. `mkdir myitsybitsy && cd myitsybitsy`
-    1. `echo "-e /Users/patrick/repos/itsybitsy" > requirements.txt`
+    1. `echo "git+ssh://git@github.com/life360/itsybitsy.git#egg=itsybitsy" > requirements.txt`
     1. `pip install -r requirements.txt`
 1. Configure charlotte - the configuration engine with which you will describe your service graph to itsybitsy
     1. `mkdir charlotte.d`
@@ -31,7 +28,6 @@ Configure charlotte, give it a seed node, and it crawls the graph/tree of your s
   1. Hint: `spider.conf` is always inherited, but you can create different profiles such as `spider.prod.conf` and reference them with the `--profile` arg
 1. Note: unlike the `spider` command, `render` is written to stand alone and parse the default json file in `outputs/.lastrun.json` it requires no arguments by default.
 
-
 ## Use
 #### 1 Run in `spider` mode:
 

diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,216 @@
+`# V1
+
+* [x] output the graph basic
+* [x] graph output prettier
+* [x] combine http and https backends
+* [x] add --max-depth
+* [x] detect service name from /etc/chef/client.rb
+* [x] add nsq graph
+* [x] add --skip-nsq-topics
+* [x] detect defunct children via haproxy stats
+* [x] skip display defunct children
+* [x] haproxy 1.6 and 1.8 compatability
+* [x] add error state for "null" NSQ clients (use "?")
+* [x] detect missing stats socket haproxy
+* [x] task: validate missing haproxy stats socket config is live manually with knife ssh
+* [x] task: validate missing haproxy stats socket is live manually with knife ssh
+* [x] add "no consumer" NSQ detection
+* [x] task: validate DEFUNCT-ness
+* [x] multiple seeds
+
+# V2.async
+* [x] use asyncssh (10x faster!)
+* [x] remove `return_exceptions=True`
+* [x] use global BASTION connection (https://github.com/ronf/asyncssh/issues/270)
+* [x] limit concurrency w/ semaphore
+* [x] split to modules
+* [x] re-use ssh connection for get-name/get-config calls
+* [x] pass lightweight node-ref through async calls instead of node dict
+* [x] remove pending node print
+* [x] deal with formatting/output-ordering implications
+* [x] convert recursive crawl from `await` to `ensure_future`
+* [x] improve live output rendering
+* [x] fix introduced parent['last_sibling'] bug
+* [x] bug: cycle is correct in the tree, but rendering zombie children (only for first level cycles?)
+* [x] retry ssh connection 3 times, fine tune concurrency
+* [x] introduced: --output=stdout is now broken due to render_node_live
+* [x] rename water to water_spout, private module function
+* [x] consolidate `find..children` error checking
+* [x] validate frontend-router
+* [x] move connection semaphore to ssh_layer
+* [x] better trace/debug log levels
+* [x] consolidate nsq node relationships w/ multiple connections
+* [x] deal w/ SSH config: bastion & username
+* [x] refactors from PR review (reduce complexity, procedural styling)
+
+# V2.features
+* [x] DISPLAY: output in json
+* [x] DISPLAY: load json file
+* [x] DISPLAY: output in graphviz
+* [x] DISPLAY: graphviz source
+* [x] CRAWL: detect proxysql
+* [x] CRAWL: cassandra
+* [x] CRAWL: detect well known ports w/ netstat & AWS name lookup (cx, memcache, redis)
+* [x] CRAWL: detect postgres well known port - causing trouble w/ name lookup
+* [x] CRAWL: user defined links
+* [x] move hints/skips to web.yaml
+* [x] keep config.yaml
+* [x] CRAWL: kinesis
+
+# V2.refactor
+* [x] move grouping of nsq topics to application layer, on service_name instead of IP
+* [x] `config_errors` -> `warnings`, `crawl_errors` -> `errors`
+* [x] refactor ssh config to ssh config file
+* [x] refactor --hide-defunct to --skip-defunct and do not even (crawl)
+* [x] graphviz warn/error color coding
+* [x] remove "cruft" handling
+* [x] add quick filter to rewrite service_name mysql-main-port_3306 to mysql-main-r/o
+* [x] create objects or named tuples (dataclasses!)
+* [x] PEP8, 120 line length 
+* [x] CHARLOTTE: make the `get_config` function into configurable parsers definable in YAML
+* [x] charlotte: replace 'null' response from NSQ for missing IP w/ actual None response
+* [x] charlotte: move crawl strategy exceptions (frontend-router) into charlotte
+* [x] charlotte: move blocking logic to charlotte
+* [x] charlotte: rename crawl_strategy -> crawl_provider on Node()
+* [x] charlotte: move service_name_rewrite to charlotte
+* [x] rename protocol_detail -> protocol_mux
+* [x] CHARLOTTE: --skip-{name} arguments
+* [x] --skip-defunct -> --hide-defunct
+* [x] refactor database named matching to port matching
+* [x] move skip services from globals to argparse
+* [x] move crawl_complete, name_lookup_complete to node.py
+* [x] charlotte config 1 file to directory of yaml files
+* [x] create default yaml file for argparse
+* [x] rename `ip` -> `instance_address`
+* [x] remove crawl strategy object from Node, denormalize (protocol, blocking)
+* [x] merge hints into pre-existing children w/ unknown address
+* [x] CORE: add sub commands for ['crawl', 'render-json']
+* [x] CORE (OSS): unit tests tests tests (round I - excluding `provider_*.py` and `crawl.py`)
+
+
+# V2.bugs
+* [x] BUG: nsq channels on same node are not grouping, again!
+* [x] there is a regression in cycle detection - spider against async-cake-handler to repro
+* [x] trim double quotes from service_name
+* [x] BUG: crawl of well known port is discovering random connections to frontend-routers, ELBs  - fixed by chris r. source ephemeral port filter
+* [x] `'CYCLE': f"service '{node['service_name']}' discovered as a parent of itself!",`
+* [x] paramiko nested exception outputting
+* [x] handle actually null (absent value) nsq consumer in additionn to string literal "null"
+* [x] ascii renderer grouping by detail is persisting in memory (groupings)
+* [x] charlotte: move name parser expections (mysql-main) into charlotte
+* [x] we see many repeating group by service-name NSQ topic/channels repeating in ascii renderer
+* [x] catch timeout for crawling children
+* [x] remove trailing `_` from node_ref
+* [x] graphviz blocking is backwards
+* [x] regression defunct in parser check on num_connections == 0 is failing
+* [x] differentiate RDS databases found in AWS - currently all show as `rdsnetworkinterface`
+* [x] BUG: add __type__ to json serialization - currently brittle: key-ing off of random fields for deserialization
+* [x] infinited recursion bug introduced by the crawl hints.  it had to do with the cached_nodes in crawl.py being by_ref object and a deep-ish copy fixed
+* [x] trying to crawl json that was outputted with --depth arg results in hanging `wait_for_crawl` to complete on nodes
+
+# V3 Kubernetes++
+* [x] CRAWL: kubernetes - take a hint
+* [x] CRAWL: kubernetes - name lookup, crawl
+* [x] support EKS cluster in a different AWS account than provider_aws
+
+
+# V3.refactor
+* [x] static code analysis (prospector) and forthcoming changes
+* [x] refactor providers to objects, remove SSH logic from crawl.py
+* [x] caching children in crawl.py instead of providers!!
+* [x] fix TIMEOUT logic
+* [x] put provider_args back in crawl strategies! use **kwargs to pass args in code
+* [x] rewrite provider registration
+* [x] move provider constant refs from constants.py into providers
+* [x] rename errors.NULL_IP NULL_ADDRESS
+* [x] refactor signature of `crawl_downstream` to include address
+* [x] replace pass through node_ref in crawl w/ `zip()`
+* [x] unit tests for crawl, providers, provider_*?
+* [x] validate that crawl strategies are only used for specified providers
+* [x] refactor lookup_name to remove life360 business logic from providers!
+* [x] remove ProviderInterface::configure(), have ssh configure itself on first query
+* [x] seed provider is configurable command line arg w/
+
+# V3.features
+* [x] FEATURE: make instance_provider args for aws hints part of a refactored "profile"
+* [x] FEATURE: Distinguish kubernetes service shape in graphviz
+* [ ] add --stop-on-nonblocking CLI arg
+
+# V3.bugs
+* [x] not respecting CrawlStrategy.providers
+* [x] need to be able to configure different AWS profile for k8s/eks than for aws! (for dev)
+* [x] BUG: intermittent timeout exceptions which do not result in program exit
+
+# V4.VOSS
+* [x] REFACTOR: (providers): providers as plugin architecture
+* [x] REFACTOR (spider): --concurrency -> --ssh-concurrency OR provider args
+* [x] REFACTOR: (all): refactor package architecture
+* [x] TIMEOUT: (crawl) robust provider timeout and exception handling
+* [x] OBSCURIFIER (render_*): obscurifier for output
+* [ ] LOGGER: rewrite logger access for community standards
+* [ ] PLUGPLAY: out of the box functionality by moving TCP to a "builtin" CrawlStrategy and using `hostname` or default service name
+* [ ] REFACTOR: (providers): rewrite take_a_hint to not return a list, just return a single NodeTransport
+* [ ] DOCS: rewrite docs in sphinx style and prepare for export to readthedocs.org
+
+# Backlog
+
+## New Features
+
+## Core
+* [ ] RENDER_PLUGINS: make renderer's an abstract class w/ plugins
+* [ ] REFACTOR: move seed logic out of ./spider.py
+* [ ] REFACTOR: revisit the Node{Protocol, CrawlStrategy, protocol_mux} object relationship strategy
+* [ ] FEATURE: track whether a node was skipped for crawling and display as such in graphviz
+* [ ] REFACTOR: move errors/warnings to a global config
+* [ ] REFACTOR: do not block crawl() on lookup_name() in main crawl loop.  will speed up many times
+* [ ] REFACTOR: move mutex from provider_ssh to crawl.py
+* [ ] BUG: intermittent timeouts crawling the whole tree - add retry to lookup_name/crawl_downstream?
+* [ ] BUG: remove `blocking` from CrawlStrategy - it should only be in Protocol
+* [ ] BUG: where is `elasticache-time-points`? crawl-netstat only takes 1 ip per port, so for async-soa which has 2 downstreams on 6379, it can't find
+* [ ] BUG: where is `cx-dvb`?? 
+* [ ] REFACTOR: consolidate Node::crawl_complete and crawl.py::_crawlable()
+* [ ] BUG: required args showing as optional in --help
+
+## Remder Ascii
+* [ ] FEATURE: merge hints in ascii output
+
+## Render Graphviz
+* [ ] FEATURE: multiple seeds display with equal ranking
+* [ ] FEATURE: nsq topics as nodes rather than edges
+* [ ] FEATURE: visualize cycles
+* [ ] FEATURE: different visualization for cache vs database
+* [ ] FEATURE: create a legend
+
+## Render JSON
+
+## Render New
+* [ ] DISPLAY: output in vizceral format
+* [ ] DISPLAY: 'diff' run on multiple seed nodes and diff!
+
+## CrawlStrategies
+* [ ] BUG: HAProxy: functionality to detect bad HAProxy Config as a crawl error was lost in async refactor  `if stdout.startswith('ERROR:'): return 'CRAWL ' + stdout.replace("\n","\t"), {}`
+* [ ] BUG: NSQ: misconfigured clients have null server (this is why we don't see rattail -> relapse), investigate & resolve
+* [ ] FEATURE: Netstat: use matchAddress for HAProxy crawl strategies to avoid timeout to RDS hostnames
+* [ ] FEATURE: crawl downstream - ability to specify more providers args per provider (so that k8s can selectively crawl containers)
+
+
+## Provider SSH
+* [ ] FEATURE: revisit whether `occupy_one_sempahore_space` is working (to dynamically configure --concurrency) 
+* [ ] FEATURE: still getting ssh connections errors sometimes with out --concurrency=10
+* [ ] FEATURE: configurable "~/.ssh/config" SSH profile
+* [ ] REFACTOR (provider_ssh): we shouldn't use known_hosts=None for security reasons
+
+## Provider AWS
+* [ ] FEATURE: lookup_name is slow, use async
+* [ ] CRAWL: dynamodb
+* [ ] CRAWL: SQS
+
+## Charlotte
+* [ ] FEATURE (charlotte): yaml validation by schema
+
+## Web
+
+# Trash Can
+* [ ] backwards compatability for haproxy w/out stats socket
+* [ ] detect live traffic netstat/tcpdump/ebpf? (this was solved by using haproxy stats)
+* [ ] remove crawl_strategy from Node()