51 Collecting Secondary Data

Many undergraduate researchers use secondary data because there is little time for them to design, implement and administer surveys, perform length ethnographies while completing their studies or to recruit, administer and analyze interviews with many people. In addition, secondary data is everywhere, thanks to the endless amount of research that is accessible by keywords online. It is therefore important that you know how to source the right kind of data, particularly for pre-defined coding methods. We will discuss this next.

Sourcing and Organizing Textual Data

There are many opportunities to use secondary textual data in your undergraduate projects (e.g., if you are doing an analysis of media, websites, Twitter, Instagram bios etc, you can simply tap online for data). If you are not using an interpretive method, you will have to develop a procedure that samples the bios beforehand. That will require you to set a parameter relating to your research question (e.g. the bios of BC environmental activists on Twitter). This limited parameter will allow you to select a narrow range of online documents which can be organized and read once your coding procedure is chosen.

For media analysis, this is more complex. Often media analysis corresponds to a specific event, attempting to understand how that event is depicted in the news, and whether other news stations share those representations. This will typically require you to (1) account for media bias in the news and (2) pick a narrow time frame and narrow topic parameter to ensure that the article is particularly referring to the event you are discussing.

If news sources are the unit of analysis, you will want to account for media bias. That is typically achieved by limiting the number of sources (say The Province, The Sun, CBC) and then sampling equally from among them (see Krippendorf, 2018 for additional strategies to account for media bias). If the research project is focused on themes from the event (with less emphasis on the source), you will have to decide on a sampling strategy to determine how to handle the voluminous data that you are likely to encounter (see sampling strategies discussed earlier).

Apart from accounting for media bias in your sampling, you should also determine a strict time frame. For example, instead of searching “Uber in Vancouver”, you might want to narrow it down to “Uber in Vancouver 2020-2021”. Doing so will prevent you from having to unnecessarily pour over hundreds of articles. In addition, it is imperative that you further restrict the attributes of articles (i.e., characteristics that you are interested in such as gender, media station, time produced) so that you can limit and account for potential biases. Even in qualitative research, it is crucial that you have a cleanly organized, limited, and (as much as is possible) unbiased sample set.

After accounting for bias and applying restrictions to your sampling, you will want a system for collecting and organizing your data (e.g., Google Search). We suggest that you keep the search words consistent and then scan for articles relevant to your procedure (or select all if that is your sampling method). Another strategy is to go directly to the news repositories. If you are doing a social media analysis, you can simply search, then copy and paste the quotes into a corpus file. Alternatively, there are many new softwares for “newsgathering” which will allow you to find all the articles that match your keywords. The following box offers four of them.

For organizing your recently collected data, NVivo and Twint are great tools for collecting and analyzing data (see Boxes and for Alexander and Bryan’s testimonials respectively). If you are a UBC student, NVivo can be downloaded for free from UBC here. There are also annual courses about NVivo basics that are run by the library, which you can sign up for here. Another alternative to NVivo is “RQDA,” a qualitative analysis tool for “R” – an open source social science coding language. This tool will more effectively perform qualitative and quantitative analyses on your text (organizing how many times a particular code has occurred and in which cases).


Box 8.3 Student Testimony – Collecting Secondary Textual Data

My Honours research investigated the discourse surrounding Uber’s integration into Vancouver, attempting to triangulate their advertising rhetoric with the conclusions of transportation regulation. Whereas accessing secondary data was easy, narrowing down the relevant data was a tougher task than it needed to be. In a bid to encompass the variety of mediums in which Uber advertised their service to Vancouver – on Youtube, Social Media, Mainstream Media, TedTalks, and within City Council hearings – I began without a stern judgement on which data to collect. This was a mistake. It took many hours of wafting through Uber commentaries about Vancouver, government backlash, and lamenting our taxi service before I adjusted my hopeless endeavour into a hopeful one. With an endlessly hashed topic like Uber, and an open method like qualitative analysis, I highly recommend you figure this out before beginning research: decide beforehand what the key data sources are, figure out how much of that source you have the time to represent and nix the data sources you cannot fairly portray.

Balance of perspectives is vital to all academic research, but perhaps especially important for thinking about collecting secondary qualitative data. As secondary data, it is data which is mediated through the perspective of another. It was therefore especially important that I gathered data from multiple stations and speakers of the Uber issue, as secondary data from merely Uber representatives would have greatly restrained the information I was using and the conclusion I came to. Having stressed myself into this conclusion, I drew up a tiered list of the variability in my data sources and tested which I could access enough data from (i.e., major Vancouver Uber influencers: Uber representatives, mainstream media stations and legislation (parliament and PTB). With that done, I nixed the sources which I did not have the space to adequately represent and tried to sample equally between the sources I could adequately represent. Once I evolved from my data slob state and established data standards, I could then move onto the next major task of secondary data collection: actually collecting the data.

Once I had figured out that it was Vancouver mainstream media and four legislative documents I was including, I made a list of the stations I would be collecting from and the time frame of the report that I was expecting. With the list made, I set about going to the archives of each of the main stations (The Province, The Vancouver Sun, The Vancouver Courier) and copy and pasting the articles into a corpus file, a collection of data (text) copy and pasted into one word document. I then uploaded the corpus file into NVivo, a qualitative analysis program which tracks the codes and cases you make, where I did my coding and data organizing. In NVivo, I could neatly organize all my articles into different cases, where I could then track comments according to the station, time, and context of the claim. NVivo allowed me to connect many cases without compromising the qualities that made them distinct (I could not recommend it enough to students using secondary qualitative data!). Throughout, I kept a running list of my citations and links to have quick reference to the source.To summarize:

  • Figure out the data you can access
  • Clarify the most relevant data in answering your question
  • Ensure variety in that data to limit bias (if there are key silences, seek them out)
    Create or use a system for organizing that data (NVivo!)
  • Update your reference list as you go

Alexander Wilson, Sociology Honours student, 2020-2021

Box 8.4 Student Testimony – Scraping Twitter Text Using Twint and R

For my thesis on the relationship between Social Media and moral panics I conducted a quantitative data analysis, achieved through tokenizing a Twitter dataset scraped via Twint (a python based script) and processed through R (a coding language used mostly by statisticians to create data visualizations and analysis) to show trends of when specific language linked to CRT and Cancel Culture pop up. Twint is essentially a data scraper which operates much like how Google operates when collecting information from web pages through a process known as “scraping” or “crawling”. In order to use Twint, you must first download Python and interface to use python (like Microsoft Visual Studio Code). This interface allows you to save any changes and write code as when you download Python, python itself is just the coding language, without an interface, you cannot interact with the language. From there, you will have to download Twint onto your computer in a specific way – follow the YouTube guide for more information. It is important to note that Twint technically does violate a part of the Terms of Service outlined by Twitter, as they prefer researchers to use their API. However, as an undergraduate student you cannot get access to their “research-level” API (which is needed most likely for research projects), as they have limited it for graduate students or faculty. The only way to get access would be to have your thesis supervisor make a submission on your behalf, and even then Twitter may reject the application. However, Twint does not violate the guidance for crawlers outlined by Twitter’s “Robot.txt” file. This file essentially tells bots what parts of Twitter they may crawl – Twint does not crawl on areas of the site that are restricted by the “Robots.txt” file.

Once you have Twint installed all you have to do is point it at a Twitter handle and build a query on Visual Studio Code (VSC). By creating a file with your code on VSC you essentially create a script that allows you to run different queries, and saves these queries so you can run them again at a later time. To begin, start by figuring out what the Twitter handle of the user you want to research is. The “twitter handle” refers to the “@” of the user. For example, UBC Sociology’s twitter handle is “@UBCSociology”. From here, you are going to create a query/boolean query. A query is a string of logical statements of “OR”, “AND”, and “NOT”. These statements essentially direct Twint to pull certain tweets from the user that you have directed the program to. For example, if used the query “SOCIOLOGY OR SOCI” and pointed Twint to the UBC Sociology page, Twint will only pull posts with the keywords “SOCIOLOGY” or “SOCI” from the UBC Sociology page. The query allows you to really search for what information you want. By pulling posts with these keywords you are interested in you are also able to build a timeline of when posts increase, decrease, or even when certain terms appear! Moreover, you can use the dataset that would be created from the data draw to closely analyze the full tweet to do a textual/content analysis. This would provide even more context to the data that you have just drawn from Twitter.

Bryan Leung, Sociology Honours student, 2021-2022

Sourcing and Organizing Secondary Quantitative Data

There is an abundance of secondary data organized into reliable online data repositories which you can use to inform your study. While we will not discuss each of these repositories, you can check out UBC Library’s page of common data repositories for datasets from major primary data agencies such as StatCan and Abacus. Most of the data is downloadable in popular softwares such as SPSS, Jamovi, R, or another analysis program. A key part of any collection of vast amounts of data is knowing how to organize it. In collecting quantitative data, you will want a strict data entry method and tools to accompany it. While you can enter your data directly into the major statistical programs (like SPSS), they all have their own unique programs for saving and entering the data (like .sav), making your data harder to transfer elsewhere (Bhattacherjee, p. 120).


Bhattacherjee, A. (2012). Social Science Research: Principles, Methods, and Practices https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1002&context=oa_textbooks

Marshall, S. (2013, August 2nd). “16 online tools for newsgathering.” Journalism.co.UK. 16 online tools for newsgathering | Media news (journalism.co.uk)


Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Practicing and Presenting Social Research by Oral Robinson and Alexander Wilson is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book