71 Student Testimonial: Data Visualizations with R
My thesis made use of simple graphics in R to present basic word frequencies of key terms in my data. R is an open-source coding tool with as much flexibility as its coding language will allow. It can organize information and visualize data through infographics, plots, charts, you name it. Despite its wide applicability, however, the language can be forbidding to the uninitiated coder. I found that each term and function can easily become dependent on more subtle information about the logic of R, resulting in many late nights on Reddit forums to understand my botched attempts to make a simple graph. In the hope that my suffering with R can make the process easier for you, I have presented a simple five step guide to descriptive statistics on R along with some resources for further exploration.
Alexander Wilson, Sociology Honours student, 2020-2021
Download R, RStudio and Install a Plotting Package
- R For Windows: Download R-4.1.2 for Windows. The R-project for statistical computing.
- R For Mac: R for macOS (r-project.org)
- RStudio: Download the RStudio IDE – RStudio
Once you have downloaded both, you should be able to start up the RStudio application, which will take you to a blank coding terminal. The RStudio package is to help make coding through R easier. It is neater and will predict the coding functions you are trying to type in.
After opening R, type in (or copy): install.packages(“ggplot2”)
Which should install the latest version of the data visualization package ggplot2.
Starting up ggplot
The following link will take you to the website of ggplot2, which has extra resources for downloading and a cheat sheet of the relevant functions you will need to know.
Plotting Package & Cheat Sheet: ggplot2 download | SourceForge.net
Once you have downloaded ggplot2, you will need to load it to use it. The load function is below:
library(ggplot2)
Input your data
To be able to visualize data on R, first it must be organized within the system. This can be simply done through the creation of basic quantitative variables. You can create a simple bivariate data frame in R like so:
data.frame(age = c(9, 10, 11, 12, 13), grade = c(6, 7, 8, 9, 10))
Where age has five cases, namely 9, 10, 11, 12, 13; and grade has five cases, 6, 7, 8, 9, 10. It is helpful to name the data.frame for simple use like so:
age_grade <- data.frame(age = c(9, 10, 11, 12, 13), grade = c(6, 7, 8, 9, 10))
From here, you can begin to plot simple descriptive statistics by typing “age_grade” into the ggplot functions outlined below.
ggplot Legend and Functions
For instance, taking our previous example, you could create a simple box chart of our data.frame (age_grade).
First you begin with the basic form of all ggplot functions
ggplot(data = <DATA>, Mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
where:
data = your file of data (in this case age)
mapping = which is determined by the function aes and then the axes that your function is using (i.e. x, y, z). It typically runs like so, aes(x = weight, y = age).
GEOM FUNCTION = the various charts you can visualize your data through (such as boxplots, geom_boxplot())
Put these together with your data like so:
ggplot(data = age_grade, mapping = aes(x = age, y = grade) + geom_boxplot())
And you should be presented with the simple following chart:
Use the following legend to be able to map out your coordinates according to many different visualizations. For simple repeated use, save your ggplot function like so:
age_grade_plot <- ggplot(data = age_grade, mapping = aes(x = age, y = grade))
And then simply add age_grade_plot to the geom function you want to use:
age_grade_plot + geom_bar()
And you should get that function. Screenshot it and it is yours to present in your paper!
Table 10.5 - Terminology with R | |
---|---|
Term | Definition |
Data | Data you visualize and a set of outlines of how you want to make it look appealing (choice of colour, bolding, etc.). |
Layers | Layers are the statistical summaries of that data which will be represented by geometric objects, geoms for short, that show what you see on the plot: points, lines, polygons, and so forth. |
Scales | Scales show the ratio or proportion in which you have mapped your data onto your graphic. |
Coord | Coord stands for a coordinate system. The coordinate system describes where the data is shown on the plane of the graphic. It provides axes and gridlines to conceptualize the data onto space. A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph. |
Faceting | Faceting can break up the data into smaller subsets and make decisions about how to use these smaller groupings of data. |
Theme | The theme refers to choices of presentation such as colour or font. |
Source: Wilson, A. (2021). Driver’s of Dissidence: A Discourse Analysis of Vancouver’s Road to Ride-Hailing. Undergraduate Thesis. (p. 13). |
Table 10.6 - R Functions | |
---|---|
Term | Definition |
Getting Started | Basic structure: ggplot(mpg, aes(x = displ, y = hwy) +
Layers develop: geom_point() You can add colour to the last component. IE: ggplot(mpg, aes(x = disl, y = hwy, colour = class). Faceting entails splitting the data into subsets and displaying the same graph for each subset. It is done with the function, facet_wrap() |
Geom Functions | geom_smooth() fits a smoother to the data and displays the smooth and its standard error.
geom_boxplot() produces a box-and-whisker plot to summarize the distribution of a set of points. geom_histogram() and geom_freqpoly() show the distribution of continuous variables. geom_bar() shows the distribution of categorical variables. geom_path() and geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. Lines are typically used to explore how things change over time. |
Histograms and Frequency Polygons | ggplot(mpg, aes(hwy)) + geom_histogram()
stat_bin() using bins = 30 or ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth= 2.5) |
Bar Charts | geom_bar() |
Time Series with Line and Path Plots | ggplot(economics, aes(date, unemploy / pop)) +
geom_line() |
Source: Wickham, H. (2016). Getting started with ggplot2. ggplot2 (pp. 11-31). Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4_2 |
References
Wickham, H. (2016). Getting started with ggplot2. ggplot2 (pp. 11-31). Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4_2