87 Scraping data from a PDF
While the PDF format is a convenient replacement for paper with complex permissions and security options, it can present barriers for accessing and manipulating data. An example of this is the in the report How COVID is Changing the World, A Statistical Perspective, although the document is licensed CC-BY, all of the data tables are ‘trapped’ in the PDF format. Rather than manually entering the data tables into a spreadsheet, in this activity you will scrape data tables into a PDF format.
For this activity, you will be freeing data tables from PDFs and creating a CSV or an Excel sheet with the data. We will be
Find a PDF: Find a PDF online that is openly licensed but is in PDF form. If you cannot find a PDF you are interested in use How COVID is Changing the World
Download Tabula:
- Download the version of Tabula for your operating system:
- Windows: tabula-win.zip
- Mac OS X: tabula-mac.zip
- Linux/Other: tabula-jar.zip, view README.txt inside for instructions
- Extract the zip file. (Instructions: Windows, Mac)
- Go into the folder you just extracted. Run the “Tabula” program inside.
- A web browser will open. If it doesn’t, open your web browser, and go to http://localhost:8080. There’s Tabula!
- Upload the PDF and extract the the data
Complete this Activity
After you do this assignment, please either export, and import it into Google Sheets and share the link to the original PDF and the sheet in the comment box below. Or simply copy and paste one of the data tables in the comment box below.
Image Credit: Image used on featured image: On Videotape by Mitchell Joyce (CC by NC 2.0)