Introduction to R

Amy Goldlist

10 Introduction to R

For the files associated with this Intro, including the Rmd file used to create this page, go to https://github.com/amygoldlist/BusinessAnalytics/tree/main/Introduction_to_R

Here’s some fun information for you to do on your own:

Installing R on your own computer:

Links:

There are two things to install. RStudio is the IDE (development environment), and R is the scripting language. They are both free!

R as a giant calculator:

## when I start a line with a #, it's a comment

### Try to understand how the code is working!


# The <- (or alt -)  assigns a variable
x <- 5
## now x is always 5


##  what is 8 times 5?
8*x

## [1] 40

## now make x be 7
x <- 7

## hmm, 8x is different!
8*x

## [1] 56

A data problem:

See the “GiG” worksheet, which is included in this bundle. This is from the textbook “Business Analytics: Communicating with Numbers” by Jaggia et al, available from McGraw Hill.

This data set contains Employees, categorized by Wage, Industry and Job. It contains missing info. Here’s a glimpse:

EmployeeID	HourlyWage	Industry	Job
20	26.09	Construction	Consultant
21	49.59	Construction
22	47.97	Construction	Accountant
23	48.77	Construction	Engineer
24	42.58		Sales Rep
25	49.7	Automotive	Engineer

Questions

We are now going to answer the following questions:

Find number of missing Values:

Hourly Wage
Industry
Job

The Number of employees who:

work in the automotive industry
Earn More than $30 per hour
Automotive Industry and earn more than $30 per hour

Find the Hourly wages:

Lowest:
Highest:
Lowest accountant in automotive:
Highest accountant in automotive:
Lowest accountant in tech:
Highest accountant in tech:

Let’s try Excel

And R:

Packages (or libraries)

Why recreate the wheel, when someone has already doen the work for us?

### if this is your first time using this, open these files:
# Dlete the # at the beginign of the line!

## Go to the end of each line and press ctrl+ enter
# install.packages("dplyr")
# install.packages("openxlsx")
# install.packages("ggplot2")
# install.packages("palmerpenguins")

## opening up these libraries or packages lets us use them
library(dplyr)  #dplyr is really great for organizing dataframes

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(openxlsx)  #openxlsx lets us read and write to Excel files. 
library(ggplot2) ## this is for making visualizations

## Warning: package 'ggplot2' was built under R version 4.0.5

library(palmerpenguins) # this is actually just cute penguins!

## Warning: package 'palmerpenguins' was built under R version 4.0.5

Open the file

## Wait, where am I on my computer?

getwd()

## [1] "S:/Personal Folders/2021_current/BABI_courses/Starting_R_for_BABI"

## Let's read in our Excel file.  The sheet option tells us which worksheet to use
gig <- read.xlsx("jaggia_ba_1e_ch02_Data_Files.xlsx",sheet ="Gig")

And let’s look at the file:

## how big is our data set?
dim(gig)

## [1] 604   4

## look at the first few rows
head(gig)

##   EmployeeID HourlyWage     Industry        Job
## 1          1      32.81 Construction    Analyst
## 2          2      46.00   Automotive   Engineer
## 3          3      43.13 Construction  Sales Rep
## 4          4      48.09   Automotive      Other
## 5          5      43.62   Automotive Accountant
## 6          6      46.98 Construction   Engineer

## look at the whole thing in a different window
# View(gig)  


## what are the columns like?  str = structure
str(gig)

## 'data.frame':    604 obs. of  4 variables:
##  $ EmployeeID: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ HourlyWage: num  32.8 46 43.1 48.1 43.6 ...
##  $ Industry  : chr  "Construction" "Automotive" "Construction" "Automotive" ...
##  $ Job       : chr  "Analyst" "Engineer" "Sales Rep" "Other" ...

Answering the questions

Find number of missing Values:

Hourly Wage

## ctrl shift m makes the cool "pipe" %>%
##step1:  pull up the gig dataset
gig %>% 
  ##step 2:  filter only the blank hourly wage
  filter(is.na(HourlyWage))

## [1] EmployeeID HourlyWage Industry   Job       
## <0 rows> (or 0-length row.names)

Industry

gig %>% 
  ## filter the blanks in Industry
  filter(is.na(Industry))

##    EmployeeID HourlyWage Industry        Job
## 1          24      42.58     <NA>  Sales Rep
## 2         139      42.18     <NA>   Engineer
## 3         361      31.33     <NA>      Other
## 4         378      48.09     <NA>      Other
## 5         441      32.35     <NA> Accountant
## 6         446      30.76     <NA> Accountant
## 7         479      42.85     <NA> Consultant
## 8         500      43.13     <NA>  Sales Rep
## 9         531      43.13     <NA>   Engineer
## 10        565      38.98     <NA> Accountant

Job

### and the blank jobs
gig %>% 
  filter(is.na(Job))

##    EmployeeID HourlyWage     Industry  Job
## 1          21      49.59 Construction <NA>
## 2          58      44.90 Construction <NA>
## 3          66      26.09 Construction <NA>
## 4          89      41.93 Construction <NA>
## 5         108      43.12 Construction <NA>
## 6         175      48.80   Automotive <NA>
## 7         212      30.74 Construction <NA>
## 8         253      44.90 Construction <NA>
## 9         291      26.09 Construction <NA>
## 10        347      26.09 Construction <NA>
## 11        355      45.00   Automotive <NA>
## 12        387      28.44 Construction <NA>
## 13        388      32.96 Construction <NA>
## 14        555      44.90 Construction <NA>
## 15        577      27.90   Automotive <NA>
## 16        593      48.98   Automotive <NA>

The Number of employees who:

work in the automotive industry

## Now lets try to count stuff using summarize...

## first take our whole data.frame
gig %>% 
  ## group by industry
  group_by(Industry) %>%
  #then count the numbers n()
  summarize(n())

## # A tibble: 4 x 2
##   Industry     `n()`
##   <chr>        <int>
## 1 Automotive     190
## 2 Construction   366
## 3 Tech            38
## 4 <NA>            10

Earn More than $30 per hour

gig %>% 
  ## filter by wage greater than $30
  filter(HourlyWage> 30) %>% 
  ##and count them
  count()

##     n
## 1 536

Automotive Industry and earn more than $30 per hour

gig %>% 
  filter(HourlyWage> 30) %>% 
  ## group by industry
  group_by(Industry) %>%
  #then count the numbers n()
  summarize(n())

## # A tibble: 4 x 2
##   Industry     `n()`
##   <chr>        <int>
## 1 Automotive     181
## 2 Construction   311
## 3 Tech            34
## 4 <NA>            10

Find the Hourly wages:

Lowest and Highest:

gig %>% 
  #then find maximum and minimum HourlyWage
  summarize(min(HourlyWage),max(HourlyWage))

##   min(HourlyWage) max(HourlyWage)
## 1           24.28              51

Lowest and Highest accountant in automotive / Tech :

## Step 1: pull up gig dataset
gig %>% 
  ## Step 2: filter by only Accountants
  filter(Job == "Accountant") %>% 
  ## Step 3: group by industry
  group_by(Industry) %>% 
  #then find maximum and minimum wage
  summarize(min(HourlyWage),max(HourlyWage))

## # A tibble: 4 x 3
##   Industry     `min(HourlyWage)` `max(HourlyWage)`
##   <chr>                    <dbl>             <dbl>
## 1 Automotive                28.7              49.3
## 2 Construction              24.3              49.9
## 3 Tech                      36.1              49.5
## 4 <NA>                      30.8              39.0

Pretty pictures

Just some basic plots:

## try commenting and uncommenting:

#plot(gig)


# gig %>% ggplot(aes(y =HourlyWage))+
#   geom_boxplot()
# 
# gig %>% ggplot(aes(y =HourlyWage, colour = Industry))+
#   geom_boxplot()+
#   theme_bw
# 
# gig %>% ggplot(aes(y =HourlyWage, colour = Job))+
#   geom_boxplot()+
#   theme_bw()
# 
# 
# 
# gig %>% ggplot(aes(x =HourlyWage))+
#   geom_histogram()+
#   theme_bw()
# 
 gig %>% ggplot(aes(x =HourlyWage))+
   geom_histogram(aes(fill = Industry))+
   theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Getting Fancy!

Let’s look at a different dataset. This one is all about penguins. Intrigued? (https://allisonhorst.github.io/palmerpenguins/)

## Look at the data



## Look at penguins

penguins

## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

References:

Take some time and learn about R from the experts: https://education.rstudio.com/learn/beginner/
This is an hour long webinar, but will give you a good overview of data wrangling in R: https://rstudio.com/resources/webinars/a-gentle-introduction-to-tidy-statistics-in-r/
Follow some tutorials at https://stat545.com/
Like this example? Buy the textbook: https://www.mheducation.com/highered/product/business-analytics-jaggia-kelly/M9781260785005.html

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Installing R on your own computer:

Links:

R as a giant calculator:

A data problem:

Questions

Find number of missing Values:

The Number of employees who:

Find the Hourly wages:

Let’s try Excel

And R:

Packages (or libraries)

Open the file

And let’s look at the file:

Answering the questions

Find number of missing Values:

The Number of employees who:

Find the Hourly wages:

Pretty pictures

Getting Fancy!

References:

License

Share This Book