This repository emerges out of teaching data science to students of various backgrounds and my practice in the industry. I aspire to contribute to the understanding of this complex landscape and teach people how to navigate it, how to develop valuable skills, and become more effective at problem-solving.
As outlined in the course website, we'll be contemplating in the library and engineering in the trenches, so here are lecture thumbnails, along with suggested practices and readings. I recommend to start your journey with the statistical fundamentals, as I re-contextualize and build on top of them towards more sophisticated, but interpretable models, which would aid decision-makers. To learn more about the interdisciplinary approach to decision-making, read the course philosophy.
- Module I: Business Decisions and Data Science
- Module II: Probability and Statistics Fundamentals
- Module III: A/B Testing and Experiment Design
- Module IV: Bayesian Hierarchical Models
- Module V: Machine Learning and Special Topics
- Module VI: Full Stack Data Apps
The repository will go through many changes as we go through the journey together, but you can get a sneak-peek of what it's about in the /playground
directory.
The slides for the first, short module are completed in /slides
and will be moved and published on the course website repo.
- First, I justify why -- what we really want is "decision science"
- What is the course about and why should you care?
- Conversations about industries, domains, and applications
- Teaching approach, learning how to learn, and course philosophy
![]() |
![]() |
---|---|
(Fig.1) - Learn what does Pollock and Picasso have to do with statistics and ML | (Fig.2) - Learn how everything you learned before fits together into a coherent whole |
The second lecture is also conceptual, as we explore and articulate hard choices businesses face. I then bring some clarity onto the big, interdisciplinary picture of AI.
- It is important to understand AI in context of business decisions and strategy
- Read here for the difference between Analytics, Statistics, and ML.
- The lecture is filled with hard-learned lessons and multiple tools for figuring out a good strategy: both for the business and AI
- R. Rumelt - The perils of bad strategy (McKinsey, 2011)
- K. Pretz - Stop Calling Everything AI, Machine-Learning Pioneer Says
- M. Jordan - Artificial Intelligence: The Revolution Hasn’t Happened Yet
First, you have to be confident and comfortable with your local development tooling. Invest an hour to understand conda and type in the commands -- benefit a decade ahead!
- Walk through this tutorial: "Introduction to conda for (data) scientists". It will serve you well for exploration and experimentation.
- For projects more focused on building data-driven applications, we will use
pip
andpoetry
. - We can use
conda
just for virtual environments and not for package management and dependency resolution / tracking. - Therefore, one has to pick an optimal approach for each project. Not great, but could be worse (as in
npm
)
- For projects more focused on building data-driven applications, we will use
- Read this old, but still relevant blog post about "Conda: Myths and Misconceptions"
- Read these two introductory articles on modules and packages
- Absolute vs Relative imports by Mbithe Nzomo
- Python Modules and Packages – An Introduction by John Sturtz
- IMPORTANT! For those of you working on Windows 10/11, here's the best set-up I know of, which involves WSL2. Here are the instructions
- Functional programming ideas in the context of numpy, pandas
- The great and terrible matplotlib
(Fig.10) - Practicing the tools for modeling and operationalization of models |
(Fig.11) - Getting comfortable with the idea of literate programming and learn the tools which make this whole zoo of technologies run harmoniously |
The third lecture is also conceptual, but in a more mathematical sense, as I attempt to build the bridge between reality and the language of uncertainty (probability theory).
![]() |
|
---|---|
(Fig.3) - How many people will show up to safari? notebook here | (Fig.4) - We discussed the importance of visual storytelling: relevance, persuasiveness,truthfulness, and aesthetics. |
- Read about a few fundamental ideas and concepts in probability and why we need them here
- To assess if you need a refresher over probability and statistics, look at this study guide
There are three amazing resources which you can use as reference and inspiration for introductory to intermediate probability and mathematical statistics. They have recorded video lectures, a freely-available book, and the first two, code:
- Probability 110 by Joe Blitzstein (Harvard), with R code. Great stories behind probabilities, numerous examples of applications, and accessible proofs.
- Probability for Data Science by Stanley Chan (Purdue), with python code. Amazing graphics, visualizations, accessible and extensive mathematical treatment.
- Probability by Santosh Venkatesh (University of Pennsylvania), once available on coursera, now on youtube. Great real-world examples from numerous domains, gentle build-up towards more complicated concepts. Unfortunately, no code or book -- but you can combine this playlist with one of the above.
If you have conda
installed on Linux, MacOS or WSL2 on Windows, the easiest way to play around with the notebook is to recreate the environment from the yml file. Then, you can either create a kernel or connect from VSCode notebooks to the environment and start hacking.
git clone https://github.com/bizovi/decision-making.git
cd playground
conda env create --file conda-env.yml
conda activate gpa-prob
# if using a jupyter lab
python -m ipykernel install --user \
--name="gpa-kernel" \
--display-name="Kernel for Simulations"
# run the test suite and see if everything works as expected
python -m pytest
Statistics is the art and science of changing your mind and action in the face of evidence. We're going to declare our assumptions and apply Bayes theorem to weight the information from data with our prior beliefs.
We're still in the land of probability and generative models, but a step closer towards making inferences about parameters and latent quantities, in order to answer the research questions.
![]() |
|
---|---|
(Fig.5) - Bayes Theorem and Rare Diseases. Inverse probabilities and conditioning notebook here | (Fig. 6) - How confident am I code has no bugs after x tests pass? Grids and point estimates |
It's time we move away from point estimates, towards a full posterior distribution, which captures the uncertainty in our estimates and can be used to make prediction about the observable quantities.
A few important ideas to add to your conceptual understanding:
- Parameter (estimand), estimator, estimation
- DeMoivre: "The most dangerous equation": are U.S. schools too big?
- What does a statistician want? Properties of estimators.
- Most practical applications won't have an analytic solution, so we have to use a probabilistic programming language like pymc to draw samples from the posterior
![]() |
![]() |
---|---|
(Fig.7) - The greatest theorem never told adapted and refactored from CamDavidson (upcoming!) | (Fig.8) - Conjugate priors and the idea of Bayesian updating. Full luxury bayes: automatic sampling, thoughtful modeling |