Ever found yourself with a collection of information-rich PDFs that you wished you could easily combine into an analysis-ready dataset? Johns Hopkins Data Services in this Data Bytes session as we provide an overview of the kinds of data that may be present in PDFs, and demo several Python packages that can be used to extract and combine it.
Website: dataservices.library.jhu.edu/
Contact us: dataservices@jhu.edu
JHU Data Services, part of the Johns Hopkins University Sheridan Libraries, helps the JHU community find, use, visualize, manage, and share data. We offer live webinars and self-paced online trainings on computational research and coding, GIS, data management, data visualization, and more. See all of our training topics on our website.
This repository contains materials for one of our live webinars open to JHU students, faculty, and staff. Please contact us with any questions.
As of March 2020, Data Services workshops are being held virtually on Zoom. See our calendar to register for upcoming workshops.
This workshop is intended to be a live demo and not a hands-on workshop.
If you would like to follow along on your own, you can follow our Python workshop series installation instructions to install Python and install the following dependencies:
pandas
pdfplumber
pypdf
PyMuPDF
- Data: This folder contains the PDFs used in the demo
- In-ClassScripts: This folder contains a Jupyter notebook containing the code written in the demo
- Extracting Data from PDFs with Python.ipynb
- PresentationMaterials: This folder contains a PDF of the slides and the Quarto files used to generate them.
If you have taken the live webinar for this class, please take this survey: bit.ly/survey-data-bytes
The presentation materials are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0), attributable to Data Services, Johns Hopkins University.
See LICENSE file for additional code licensing and re-use information.
The images, external resources, and cheatsheets linked in this repository may have other licenses and terms of use.
Please cite this material as:
Johns Hopkins University Data Services. October 7, 2024. Extracting Data from PDFs with Python. [https://github.com/jhu-data-services/data-bytes-extracting-pdf-data-python]