{"id":303,"date":"2021-06-09T18:50:33","date_gmt":"2021-06-09T22:50:33","guid":{"rendered":"https:\/\/pressbooks.bccampus.ca\/pose2\/chapter\/scraping-data-from-a-pdf\/"},"modified":"2021-07-09T17:53:15","modified_gmt":"2021-07-09T21:53:15","slug":"scraping-data-from-a-pdf","status":"publish","type":"chapter","link":"https:\/\/pressbooks.bccampus.ca\/pose2\/chapter\/scraping-data-from-a-pdf\/","title":{"raw":"Scraping data from a PDF","rendered":"Scraping data from a PDF"},"content":{"raw":"<!-- wp:group -->\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container\"><!-- wp:group -->\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container\"><!-- wp:group {\"className\":\"emphasisbox\"} -->\r\n<div class=\"wp-block-group emphasisbox\">\r\n<div class=\"wp-block-group__inner-container\"><!-- wp:heading {\"level\":4} -->\r\n<h4>While the PDF format is a convenient replacement for paper with complex permissions and security options, it can present barriers for accessing and manipulating data. An example of this is the in the report <a href=\"https:\/\/unstats.un.org\/unsd\/ccsa\/documents\/covid19-report-ccsa.pdf\" data-type=\"URL\" data-id=\"https:\/\/unstats.un.org\/unsd\/ccsa\/documents\/covid19-report-ccsa.pdf\">How COVID is Changing the World, A Statistical Perspective<\/a>, although the document is licensed CC-BY, all of the data tables are 'trapped' in the PDF format. Rather than manually entering the data tables into a spreadsheet, in this activity you will scrape data tables into a PDF format.<\/h4>\r\n<!-- \/wp:heading --><\/div>\r\n<\/div>\r\n<!-- \/wp:group -->\r\n\r\n<!-- wp:separator --><hr class=\"wp-block-separator\" \/><!-- \/wp:separator -->\r\n\r\n<!-- wp:group -->\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container\"><!-- wp:columns -->\r\n<div class=\"wp-block-columns\"><!-- wp:column {\"width\":66.66} -->\r\n<div class=\"wp-block-column\" style=\"flex-basis: 66.66%\"><!-- wp:paragraph -->\r\n<p>For this activity, you will be freeing data tables from PDFs and creating a CSV or an Excel sheet with the data. We will be<\/p>\r\n<!-- \/wp:paragraph -->\r\n\r\n<!-- wp:paragraph -->\r\n<p><strong>Find a PDF<\/strong>: Find a PDF online that is openly licensed but is in PDF form. If you cannot find a PDF you are interested in use <a href=\"https:\/\/unstats.un.org\/unsd\/ccsa\/documents\/covid19-report-ccsa.pdf\">How COVID is Changing the World<\/a><\/p>\r\n<!-- \/wp:paragraph -->\r\n\r\n<!-- wp:paragraph -->\r\n<p><strong>Download <\/strong>Tabula:<\/p>\r\n<!-- \/wp:paragraph -->\r\n\r\n<!-- wp:list {\"ordered\":true} -->\r\n<ol>\r\n<li>Download the version of Tabula for your operating system:\r\n<ul>\r\n<li><strong>Windows:<\/strong>\u00a0<a href=\"https:\/\/github.com\/tabulapdf\/tabula\/releases\/download\/v1.2.1\/tabula-win-1.2.1.zip\">tabula-win.zip<\/a><\/li>\r\n<li><strong>Mac OS X:<\/strong>\u00a0<a href=\"https:\/\/github.com\/tabulapdf\/tabula\/releases\/download\/v1.2.1\/tabula-mac-1.2.1.zip\">tabula-mac.zip<\/a><\/li>\r\n<li><strong>Linux\/Other:<\/strong>\u00a0<a href=\"https:\/\/github.com\/tabulapdf\/tabula\/releases\/download\/v1.2.1\/tabula-jar-1.2.1.zip\">tabula-jar.zip<\/a>, view README.txt inside for instructions<\/li>\r\n<\/ul>\r\n<\/li>\r\n<li>Extract the zip file. (Instructions:\u00a0<a href=\"http:\/\/windows.microsoft.com\/en-us\/windows-8\/zip-unzip-files\">Windows<\/a>,\u00a0<a href=\"http:\/\/support.apple.com\/kb\/PH10915\">Mac<\/a>)<\/li>\r\n<li>Go into the folder you just extracted. Run the \"Tabula\" program inside.<\/li>\r\n<li>A web browser will open. If it doesn't, open your web browser, and go to\u00a0<a href=\"http:\/\/localhost:8080\/\">http:\/\/localhost:8080<\/a>. There's Tabula!<\/li>\r\n<li>Upload the PDF and extract the the data<\/li>\r\n<\/ol>\r\n<!-- \/wp:list -->\r\n\r\n<!-- wp:paragraph -->\r\n<p>&nbsp;<\/p>\r\n<!-- \/wp:paragraph --><\/div>\r\n<!-- \/wp:column -->\r\n\r\n<!-- wp:column {\"verticalAlignment\":\"center\",\"width\":33.33,\"className\":\"challengeExample\"} -->\r\n<div class=\"wp-block-column is-vertically-aligned-center challengeExample\" style=\"flex-basis: 33.33%\"><!-- wp:group -->\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container\"><!-- wp:heading -->\r\n<h2>Resources<\/h2>\r\n<!-- \/wp:heading -->\r\n\r\n<!-- wp:paragraph -->\r\n<p><a href=\"https:\/\/tabula.technology\/\" data-type=\"URL\" data-id=\"https:\/\/tabula.technology\/\">Tabula<\/a>: Scrape Data Tables from PDFs<\/p>\r\n<!-- \/wp:paragraph -->\r\n\r\n<!-- wp:paragraph -->\r\n<p><a href=\"https:\/\/schoolofdata.org\/extracting-data-from-pdfs\/?__cf_chl_jschl_tk__=dc2b5bc440c268d603e9b6c759e399692a85f5ec-1606257626-0-AZETLkib7D3cHox44E2eZxsR7gl3t5TiqpG0khDk_gfT_Z-gQXaONOLoq0nkp5QbxXr7KRMWvReANSECjJL9XND-9iA-sUMn0408egrfEXsIs7bOGiLBSjtOMcesZEKayXleA3vXHDNkwc1kxBdtsr4SiS_HXae2YFYRQuapgoQ-rpAFfwjWBTLJJoMd7PSL-1TFMyCCSItCUTO-ZGNAbyNcZ1xL59VpxTFE-wkbtRAh7IgOU9Hagio3VgF6UrUdqixEDilGRfgPbwr91o2noqKE4rJgyvM1YwxTakEPB30RUV_VCy9n6J4APG8YWbJq6p0liEGPyauCPMoV2fDbOZ4\" data-type=\"URL\" data-id=\"https:\/\/schoolofdata.org\/extracting-data-from-pdfs\/?__cf_chl_jschl_tk__=dc2b5bc440c268d603e9b6c759e399692a85f5ec-1606257626-0-AZETLkib7D3cHox44E2eZxsR7gl3t5TiqpG0khDk_gfT_Z-gQXaONOLoq0nkp5QbxXr7KRMWvReANSECjJL9XND-9iA-sUMn0408egrfEXsIs7bOGiLBSjtOMcesZEKayXleA3vXHDNkwc1kxBdtsr4SiS_HXae2YFYRQuapgoQ-rpAFfwjWBTLJJoMd7PSL-1TFMyCCSItCUTO-ZGNAbyNcZ1xL59VpxTFE-wkbtRAh7IgOU9Hagio3VgF6UrUdqixEDilGRfgPbwr91o2noqKE4rJgyvM1YwxTakEPB30RUV_VCy9n6J4APG8YWbJq6p0liEGPyauCPMoV2fDbOZ4\">School of Data Tutorial:<\/a> Using Tabula to Scrap Data<\/p>\r\n<!-- \/wp:paragraph -->\r\n\r\n<!-- wp:paragraph -->\r\n<p>&nbsp;<\/p>\r\n<!-- \/wp:paragraph --><\/div>\r\n<\/div>\r\n<!-- \/wp:group -->\r\n\r\n<!-- wp:paragraph -->\r\n<p>&nbsp;<\/p>\r\n<!-- \/wp:paragraph --><\/div>\r\n<!-- \/wp:column --><\/div>\r\n<!-- \/wp:columns --><\/div>\r\n<\/div>\r\n<!-- \/wp:group -->\r\n\r\n<!-- wp:group -->\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container\"><!-- wp:heading {\"level\":3} -->\r\n<h3>Complete this Activity<\/h3>\r\n<!-- \/wp:heading -->\r\n\r\n<!-- wp:paragraph -->\r\n<p>After you do this assignment, please either export, and import it into Google Sheets and share the link to the original PDF and the sheet in the comment box below. Or simply copy and paste one of the data tables in the comment box below.<\/p>\r\n<!-- \/wp:paragraph --><\/div>\r\n<\/div>\r\n<!-- \/wp:group --><\/div>\r\n<\/div>\r\n<!-- \/wp:group -->\r\n\r\n<!-- wp:verse {\"textAlign\":\"center\"} -->\r\n<pre class=\"wp-block-verse has-text-align-center\"><strong>Image Credit<\/strong>: Image used on featured image:  <a href=\"https:\/\/www.flickr.com\/photos\/hckyso\/1642543450\/\" target=\"_blank\" rel=\"noreferrer noopener\">On Videotape<\/a> by <a href=\"https:\/\/www.flickr.com\/photos\/hckyso\/\">Mitchell Joyce <\/a> (<a href=\"https:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/\" target=\"_blank\" rel=\"noreferrer noopener\">CC by NC 2.0<\/a>) <\/pre>\r\n<!-- \/wp:verse --><\/div>\r\n<\/div>\r\n<!-- \/wp:group -->\r\n<p>&nbsp;<\/p>","rendered":"\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\r\n<div class=\"wp-block-group emphasisbox\">\r\n<div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\r\n<h4 class=\"wp-block-heading\">While the PDF format is a convenient replacement for paper with complex permissions and security options, it can present barriers for accessing and manipulating data. An example of this is the in the report <a href=\"https:\/\/unstats.un.org\/unsd\/ccsa\/documents\/covid19-report-ccsa.pdf\" data-type=\"url\" data-id=\"https:\/\/unstats.un.org\/unsd\/ccsa\/documents\/covid19-report-ccsa.pdf\">How COVID is Changing the World, A Statistical Perspective<\/a>, although the document is licensed CC-BY, all of the data tables are &#8216;trapped&#8217; in the PDF format. Rather than manually entering the data tables into a spreadsheet, in this activity you will scrape data tables into a PDF format.<\/h4>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n<hr class=\"wp-block-separator\" \/>\r\n\r\n\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\r\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\r\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis: 66.66%\">\r\n<p>For this activity, you will be freeing data tables from PDFs and creating a CSV or an Excel sheet with the data. We will be<\/p>\r\n\r\n\r\n\r\n<p><strong>Find a PDF<\/strong>: Find a PDF online that is openly licensed but is in PDF form. If you cannot find a PDF you are interested in use <a href=\"https:\/\/unstats.un.org\/unsd\/ccsa\/documents\/covid19-report-ccsa.pdf\">How COVID is Changing the World<\/a><\/p>\r\n\r\n\r\n\r\n<p><strong>Download <\/strong>Tabula:<\/p>\r\n\r\n\r\n\r\n<ol class=\"wp-block-list\">\r\n<li>Download the version of Tabula for your operating system:\r\n<ul>\r\n<li><strong>Windows:<\/strong>\u00a0<a href=\"https:\/\/github.com\/tabulapdf\/tabula\/releases\/download\/v1.2.1\/tabula-win-1.2.1.zip\">tabula-win.zip<\/a><\/li>\r\n<li><strong>Mac OS X:<\/strong>\u00a0<a href=\"https:\/\/github.com\/tabulapdf\/tabula\/releases\/download\/v1.2.1\/tabula-mac-1.2.1.zip\">tabula-mac.zip<\/a><\/li>\r\n<li><strong>Linux\/Other:<\/strong>\u00a0<a href=\"https:\/\/github.com\/tabulapdf\/tabula\/releases\/download\/v1.2.1\/tabula-jar-1.2.1.zip\">tabula-jar.zip<\/a>, view README.txt inside for instructions<\/li>\r\n<\/ul>\r\n<\/li>\r\n<li>Extract the zip file. (Instructions:\u00a0<a href=\"http:\/\/windows.microsoft.com\/en-us\/windows-8\/zip-unzip-files\">Windows<\/a>,\u00a0<a href=\"http:\/\/support.apple.com\/kb\/PH10915\">Mac<\/a>)<\/li>\r\n<li>Go into the folder you just extracted. Run the &#8220;Tabula&#8221; program inside.<\/li>\r\n<li>A web browser will open. If it doesn&#8217;t, open your web browser, and go to\u00a0<a href=\"http:\/\/localhost:8080\/\">http:\/\/localhost:8080<\/a>. There&#8217;s Tabula!<\/li>\r\n<li>Upload the PDF and extract the the data<\/li>\r\n<\/ol>\r\n\r\n\r\n\r\n<p>&nbsp;<\/p>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-column is-vertically-aligned-center challengeExample is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis: 33.33%\">\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\r\n<h2 class=\"wp-block-heading\">Resources<\/h2>\r\n\r\n\r\n\r\n<p><a href=\"https:\/\/tabula.technology\/\" data-type=\"url\" data-id=\"https:\/\/tabula.technology\/\">Tabula<\/a>: Scrape Data Tables from PDFs<\/p>\r\n\r\n\r\n\r\n<p><a href=\"https:\/\/schoolofdata.org\/extracting-data-from-pdfs\/?__cf_chl_jschl_tk__=dc2b5bc440c268d603e9b6c759e399692a85f5ec-1606257626-0-AZETLkib7D3cHox44E2eZxsR7gl3t5TiqpG0khDk_gfT_Z-gQXaONOLoq0nkp5QbxXr7KRMWvReANSECjJL9XND-9iA-sUMn0408egrfEXsIs7bOGiLBSjtOMcesZEKayXleA3vXHDNkwc1kxBdtsr4SiS_HXae2YFYRQuapgoQ-rpAFfwjWBTLJJoMd7PSL-1TFMyCCSItCUTO-ZGNAbyNcZ1xL59VpxTFE-wkbtRAh7IgOU9Hagio3VgF6UrUdqixEDilGRfgPbwr91o2noqKE4rJgyvM1YwxTakEPB30RUV_VCy9n6J4APG8YWbJq6p0liEGPyauCPMoV2fDbOZ4\" data-type=\"url\" data-id=\"https:\/\/schoolofdata.org\/extracting-data-from-pdfs\/?__cf_chl_jschl_tk__=dc2b5bc440c268d603e9b6c759e399692a85f5ec-1606257626-0-AZETLkib7D3cHox44E2eZxsR7gl3t5TiqpG0khDk_gfT_Z-gQXaONOLoq0nkp5QbxXr7KRMWvReANSECjJL9XND-9iA-sUMn0408egrfEXsIs7bOGiLBSjtOMcesZEKayXleA3vXHDNkwc1kxBdtsr4SiS_HXae2YFYRQuapgoQ-rpAFfwjWBTLJJoMd7PSL-1TFMyCCSItCUTO-ZGNAbyNcZ1xL59VpxTFE-wkbtRAh7IgOU9Hagio3VgF6UrUdqixEDilGRfgPbwr91o2noqKE4rJgyvM1YwxTakEPB30RUV_VCy9n6J4APG8YWbJq6p0liEGPyauCPMoV2fDbOZ4\">School of Data Tutorial:<\/a> Using Tabula to Scrap Data<\/p>\r\n\r\n\r\n\r\n<p>&nbsp;<\/p>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<p>&nbsp;<\/p>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<div class=\"wp-block-group\">\r\n<div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\r\n<h3 class=\"wp-block-heading\">Complete this Activity<\/h3>\r\n\r\n\r\n\r\n<p>After you do this assignment, please either export, and import it into Google Sheets and share the link to the original PDF and the sheet in the comment box below. Or simply copy and paste one of the data tables in the comment box below.<\/p>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n<\/div>\r\n\r\n\r\n\r\n<pre class=\"wp-block-verse has-text-align-center\"><strong>Image Credit<\/strong>: Image used on featured image:  <a href=\"https:\/\/www.flickr.com\/photos\/hckyso\/1642543450\/\" target=\"_blank\" rel=\"noreferrer noopener\">On Videotape<\/a> by <a href=\"https:\/\/www.flickr.com\/photos\/hckyso\/\">Mitchell Joyce <\/a> (<a href=\"https:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/\" target=\"_blank\" rel=\"noreferrer noopener\">CC by NC 2.0<\/a>) <\/pre>\r\n<\/div>\r\n<\/div>\r\n\r\n<p>&nbsp;<\/p>","protected":false},"author":192,"menu_order":134,"template":"","meta":{"pb_show_title":"on","pb_short_title":"","pb_subtitle":"","pb_authors":[],"pb_section_license":""},"chapter-type":[],"contributor":[],"license":[],"class_list":["post-303","chapter","type-chapter","status-publish","hentry"],"part":359,"_links":{"self":[{"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/pressbooks\/v2\/chapters\/303","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/pressbooks\/v2\/chapters"}],"about":[{"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/wp\/v2\/types\/chapter"}],"author":[{"embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/wp\/v2\/users\/192"}],"version-history":[{"count":1,"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/pressbooks\/v2\/chapters\/303\/revisions"}],"predecessor-version":[{"id":501,"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/pressbooks\/v2\/chapters\/303\/revisions\/501"}],"part":[{"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/pressbooks\/v2\/parts\/359"}],"metadata":[{"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/pressbooks\/v2\/chapters\/303\/metadata\/"}],"wp:attachment":[{"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/wp\/v2\/media?parent=303"}],"wp:term":[{"taxonomy":"chapter-type","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/pressbooks\/v2\/chapter-type?post=303"},{"taxonomy":"contributor","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/wp\/v2\/contributor?post=303"},{"taxonomy":"license","embeddable":true,"href":"https:\/\/pressbooks.bccampus.ca\/pose2\/wp-json\/wp\/v2\/license?post=303"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}