Product Rankings with Bayesian Estimation

To demonstrate using Bayesian estimation for product rankings versus ranking products by their average rating, I scraped product data from Sephora's website. You can read more about Bayesian estimation on my website

Scraping the Data

To run the web scraper, fork or download this repository and navigate to the folder in a terminal window. Then type

python sephora_scraper.py

The web scraper takes a minute or two to run. The output will be saved to a pickle file called sephora_data.pkl.

Since Sephora's website utilizes Angularjs, BeautifulSoup does not work to scrape the data. I solved this by parsing through the raw html string using regular expressions. For example, line 27 in sephora_scraper.py:

products = unicode(re.search(r"(?<=products).*?(?=meta)", html).group(0)).replace('}}],"', '}}').replace('":[{', '{')

This line takes a slice of the html string between the words products and meta, then deletes some of the bracket junk at the start and end of the returned string.

products appears to be a JSON object loaded with the web page. In lines 29 to 48 of sephora_scraper.py, I split the JSON string into seperate products, then read the data for each product using the python json package, and save the information that I want to use into a python dictionary.

Lines 50 to 63 then loads each product page to add additional data to the dictionary that isn't available on the product search page, and saves out the dictionary to a pickle file.

Transforming the Data

Start a Jupyter notebook with the command

ipython notebook

Open the notebook sephora_data_clean.ipynb. Running the cells in the notebook will

read in the pickled data file
transform the data into a pandas dataframe
add a column for the Bayesian estimate
output two csv files, sephora.csv and sephora-table.csv

The two csv files are utilized by the javascript file, sephora.js, to create a table and some d3 charts. If you run the web scraper and Jupyter notebook, you will overwrite the csv files downloaded with this repository.

Visualizations

The code for the data visualizations is in the visualizations folder. To view in a web browser, navigate to this folder in a terminal window and start a local server with the command

python -m SimpleHTTPServer

then open a web browser and navigate to localhost:8000.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
visualizations		visualizations
README.md		README.md
sephora_data_clean.ipynb		sephora_data_clean.ipynb
sephora_scraper.py		sephora_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Rankings with Bayesian Estimation

Scraping the Data

Transforming the Data

Visualizations

About

Releases

Packages

Languages

emschuch/Product_Rankings_with_Bayesian_Estimation

Folders and files

Latest commit

History

Repository files navigation

Product Rankings with Bayesian Estimation

Scraping the Data

Transforming the Data

Visualizations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages