Skip to content

jhu-data-services/data-bytes-extracting-pdf-data-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extracting Data from PDFs with Python

Ever found yourself with a collection of information-rich PDFs that you wished you could easily combine into an analysis-ready dataset? Johns Hopkins Data Services in this Data Bytes session as we provide an overview of the kinds of data that may be present in PDFs, and demo several Python packages that can be used to extract and combine it.

JHU Data Services

Website: dataservices.library.jhu.edu/
Contact us: dataservices@jhu.edu
JHU Data Services, part of the Johns Hopkins University Sheridan Libraries, helps the JHU community find, use, visualize, manage, and share data. We offer live webinars and self-paced online trainings on computational research and coding, GIS, data management, data visualization, and more. See all of our training topics on our website.

This repository contains materials for one of our live webinars open to JHU students, faculty, and staff. Please contact us with any questions.

As of March 2020, Data Services workshops are being held virtually on Zoom. See our calendar to register for upcoming workshops.

Pre-Class Instructions

This workshop is intended to be a live demo and not a hands-on workshop.

If you would like to follow along on your own, you can follow our Python workshop series installation instructions to install Python and install the following dependencies:

  • pandas
  • pdfplumber
  • pypdf
  • PyMuPDF

Description of Files

  • Data: This folder contains the PDFs used in the demo
  • In-ClassScripts: This folder contains a Jupyter notebook containing the code written in the demo
    • Extracting Data from PDFs with Python.ipynb
  • PresentationMaterials: This folder contains a PDF of the slides and the Quarto files used to generate them.

Post-Class Survey

If you have taken the live webinar for this class, please take this survey: bit.ly/survey-data-bytes

License and Terms of Use

The presentation materials are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0), attributable to Data Services, Johns Hopkins University.

See LICENSE file for additional code licensing and re-use information.

The images, external resources, and cheatsheets linked in this repository may have other licenses and terms of use.

Citation

Please cite this material as:
Johns Hopkins University Data Services. October 7, 2024. Extracting Data from PDFs with Python. [https://github.com/jhu-data-services/data-bytes-extracting-pdf-data-python]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published