87 Scraping data from a PDF

While the PDF format is a convenient replacement for paper with complex permissions and security options, it can present barriers for accessing and manipulating data. An example of this is the in the report How COVID is Changing the World, A Statistical Perspective, although the document is licensed CC-BY, all of the data tables are ‘trapped’ in the PDF format. Rather than manually entering the data tables into a spreadsheet, in this activity you will scrape data tables into a PDF format.


For this activity, you will be freeing data tables from PDFs and creating a CSV or an Excel sheet with the data. We will be

Find a PDF: Find a PDF online that is openly licensed but is in PDF form. If you cannot find a PDF you are interested in use How COVID is Changing the World

Download Tabula:

  1. Download the version of Tabula for your operating system:
  2. Extract the zip file. (Instructions: WindowsMac)
  3. Go into the folder you just extracted. Run the “Tabula” program inside.
  4. A web browser will open. If it doesn’t, open your web browser, and go to http://localhost:8080. There’s Tabula!
  5. Upload the PDF and extract the the data

 

Resources

Tabula: Scrape Data Tables from PDFs

School of Data Tutorial: Using Tabula to Scrap Data

 

 

Complete this Activity

After you do this assignment, please either export, and import it into Google Sheets and share the link to the original PDF and the sheet in the comment box below. Or simply copy and paste one of the data tables in the comment box below.

Image Credit: Image used on featured image:  On Videotape by Mitchell Joyce  (CC by NC 2.0) 

 

License

Program for Open Scholarship and Education Copyright © by will. All Rights Reserved.

Share This Book