Extracting Data from PDFs with Python

Ever found yourself with a collection of information-rich PDFs that you wished you could easily combine into an analysis-ready dataset? Johns Hopkins Data Services in this Data Bytes session as we provide an overview of the kinds of data that may be present in PDFs, and demo several Python packages that can be used to extract and combine it.

JHU Data Services

Website: dataservices.library.jhu.edu/
Contact us: dataservices@jhu.edu
JHU Data Services, part of the Johns Hopkins University Sheridan Libraries, helps the JHU community find, use, visualize, manage, and share data. We offer live webinars and self-paced online trainings on computational research and coding, GIS, data management, data visualization, and more. See all of our training topics on our website.

This repository contains materials for one of our live webinars open to JHU students, faculty, and staff. Please contact us with any questions.

As of March 2020, Data Services workshops are being held virtually on Zoom. See our calendar to register for upcoming workshops.

Pre-Class Instructions

This workshop is intended to be a live demo and not a hands-on workshop.

If you would like to follow along on your own, you can follow our Python workshop series installation instructions to install Python and install the following dependencies:

pandas
pdfplumber
pypdf
PyMuPDF

Description of Files

Data: This folder contains the PDFs used in the demo
In-ClassScripts: This folder contains a Jupyter notebook containing the code written in the demo
- Extracting Data from PDFs with Python.ipynb
PresentationMaterials: This folder contains a PDF of the slides and the Quarto files used to generate them.

Post-Class Survey

If you have taken the live webinar for this class, please take this survey: bit.ly/survey-data-bytes

License and Terms of Use

The presentation materials are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0), attributable to Data Services, Johns Hopkins University.

See LICENSE file for additional code licensing and re-use information.

The images, external resources, and cheatsheets linked in this repository may have other licenses and terms of use.

Citation

Please cite this material as:
Johns Hopkins University Data Services. October 7, 2024. Extracting Data from PDFs with Python. [https://github.com/jhu-data-services/data-bytes-extracting-pdf-data-python]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Data		Data
In-ClassScripts		In-ClassScripts
PresentationMaterials		PresentationMaterials
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extracting Data from PDFs with Python

JHU Data Services

Pre-Class Instructions

Description of Files

Post-Class Survey

License and Terms of Use

Citation

About

Releases

Packages

Languages

License

jhu-data-services/data-bytes-extracting-pdf-data-python

Folders and files

Latest commit

History

Repository files navigation

Extracting Data from PDFs with Python

JHU Data Services

Pre-Class Instructions

Description of Files

Post-Class Survey

License and Terms of Use

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages