10 Introduction to R

For the files associated with this Intro, including the Rmd file used to create this page, go to https://github.com/amygoldlist/BusinessAnalytics/tree/main/Introduction_to_R

Here’s some fun information for you to do on your own:

Installing R on your own computer:

R as a giant calculator:

## when I start a line with a #, it's a comment

### Try to understand how the code is working!


# The <- (or alt -)  assigns a variable
x <- 5
## now x is always 5


##  what is 8 times 5?
8*x
## [1] 40
## now make x be 7
x <- 7

## hmm, 8x is different!
8*x
## [1] 56

A data problem:

See the “GiG” worksheet, which is included in this bundle. This is from the textbook “Business Analytics: Communicating with Numbers” by Jaggia et al, available from McGraw Hill.

This data set contains Employees, categorized by Wage, Industry and Job. It contains missing info. Here’s a glimpse:

EmployeeID HourlyWage Industry Job
20 26.09 Construction Consultant
21 49.59 Construction
22 47.97 Construction Accountant
23 48.77 Construction Engineer
24 42.58 Sales Rep
25 49.7 Automotive Engineer

Questions

We are now going to answer the following questions:

Find number of missing Values:
  • Hourly Wage
  • Industry
  • Job
The Number of employees who:
  • work in the automotive industry
  • Earn More than $30 per hour
  • Automotive Industry and earn more than $30 per hour
Find the Hourly wages:
  • Lowest:
  • Highest:
  • Lowest accountant in automotive:
  • Highest accountant in automotive:
  • Lowest accountant in tech:
  • Highest accountant in tech:

Let’s try Excel

And R:

Packages (or libraries)

Why recreate the wheel, when someone has already doen the work for us?

### if this is your first time using this, open these files:
# Dlete the # at the beginign of the line!

## Go to the end of each line and press ctrl+ enter
# install.packages("dplyr")
# install.packages("openxlsx")
# install.packages("ggplot2")
# install.packages("palmerpenguins")

## opening up these libraries or packages lets us use them
library(dplyr)  #dplyr is really great for organizing dataframes
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(openxlsx)  #openxlsx lets us read and write to Excel files. 
library(ggplot2) ## this is for making visualizations
## Warning: package 'ggplot2' was built under R version 4.0.5
library(palmerpenguins) # this is actually just cute penguins!
## Warning: package 'palmerpenguins' was built under R version 4.0.5

Open the file

## Wait, where am I on my computer?

getwd()
## [1] "S:/Personal Folders/2021_current/BABI_courses/Starting_R_for_BABI"
## Let's read in our Excel file.  The sheet option tells us which worksheet to use
gig <- read.xlsx("jaggia_ba_1e_ch02_Data_Files.xlsx",sheet ="Gig")

And let’s look at the file:

## how big is our data set?
dim(gig)  
## [1] 604   4
## look at the first few rows
head(gig)  
##   EmployeeID HourlyWage     Industry        Job
## 1          1      32.81 Construction    Analyst
## 2          2      46.00   Automotive   Engineer
## 3          3      43.13 Construction  Sales Rep
## 4          4      48.09   Automotive      Other
## 5          5      43.62   Automotive Accountant
## 6          6      46.98 Construction   Engineer
## look at the whole thing in a different window
# View(gig)  


## what are the columns like?  str = structure
str(gig)
## 'data.frame':    604 obs. of  4 variables:
##  $ EmployeeID: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ HourlyWage: num  32.8 46 43.1 48.1 43.6 ...
##  $ Industry  : chr  "Construction" "Automotive" "Construction" "Automotive" ...
##  $ Job       : chr  "Analyst" "Engineer" "Sales Rep" "Other" ...

Answering the questions

Find number of missing Values:
  • Hourly Wage
## ctrl shift m makes the cool "pipe" %>%
##step1:  pull up the gig dataset
gig %>% 
  ##step 2:  filter only the blank hourly wage
  filter(is.na(HourlyWage))
## [1] EmployeeID HourlyWage Industry   Job       
## <0 rows> (or 0-length row.names)
  • Industry
gig %>% 
  ## filter the blanks in Industry
  filter(is.na(Industry))
##    EmployeeID HourlyWage Industry        Job
## 1          24      42.58     <NA>  Sales Rep
## 2         139      42.18     <NA>   Engineer
## 3         361      31.33     <NA>      Other
## 4         378      48.09     <NA>      Other
## 5         441      32.35     <NA> Accountant
## 6         446      30.76     <NA> Accountant
## 7         479      42.85     <NA> Consultant
## 8         500      43.13     <NA>  Sales Rep
## 9         531      43.13     <NA>   Engineer
## 10        565      38.98     <NA> Accountant
  • Job
### and the blank jobs
gig %>% 
  filter(is.na(Job))
##    EmployeeID HourlyWage     Industry  Job
## 1          21      49.59 Construction <NA>
## 2          58      44.90 Construction <NA>
## 3          66      26.09 Construction <NA>
## 4          89      41.93 Construction <NA>
## 5         108      43.12 Construction <NA>
## 6         175      48.80   Automotive <NA>
## 7         212      30.74 Construction <NA>
## 8         253      44.90 Construction <NA>
## 9         291      26.09 Construction <NA>
## 10        347      26.09 Construction <NA>
## 11        355      45.00   Automotive <NA>
## 12        387      28.44 Construction <NA>
## 13        388      32.96 Construction <NA>
## 14        555      44.90 Construction <NA>
## 15        577      27.90   Automotive <NA>
## 16        593      48.98   Automotive <NA>
The Number of employees who:
  • work in the automotive industry
## Now lets try to count stuff using summarize...

## first take our whole data.frame
gig %>% 
  ## group by industry
  group_by(Industry) %>%
  #then count the numbers n()
  summarize(n())
## # A tibble: 4 x 2
##   Industry     `n()`
##   <chr>        <int>
## 1 Automotive     190
## 2 Construction   366
## 3 Tech            38
## 4 <NA>            10
  • Earn More than $30 per hour
gig %>% 
  ## filter by wage greater than $30
  filter(HourlyWage> 30) %>% 
  ##and count them
  count()
##     n
## 1 536
  • Automotive Industry and earn more than $30 per hour
gig %>% 
  filter(HourlyWage> 30) %>% 
  ## group by industry
  group_by(Industry) %>%
  #then count the numbers n()
  summarize(n())
## # A tibble: 4 x 2
##   Industry     `n()`
##   <chr>        <int>
## 1 Automotive     181
## 2 Construction   311
## 3 Tech            34
## 4 <NA>            10
Find the Hourly wages:
  • Lowest and Highest:
gig %>% 
  #then find maximum and minimum HourlyWage
  summarize(min(HourlyWage),max(HourlyWage))
##   min(HourlyWage) max(HourlyWage)
## 1           24.28              51
  • Lowest and Highest accountant in automotive / Tech :
## Step 1: pull up gig dataset
gig %>% 
  ## Step 2: filter by only Accountants
  filter(Job == "Accountant") %>% 
  ## Step 3: group by industry
  group_by(Industry) %>% 
  #then find maximum and minimum wage
  summarize(min(HourlyWage),max(HourlyWage))
## # A tibble: 4 x 3
##   Industry     `min(HourlyWage)` `max(HourlyWage)`
##   <chr>                    <dbl>             <dbl>
## 1 Automotive                28.7              49.3
## 2 Construction              24.3              49.9
## 3 Tech                      36.1              49.5
## 4 <NA>                      30.8              39.0

Pretty pictures

Just some basic plots:

## try commenting and uncommenting:

#plot(gig)


# gig %>% ggplot(aes(y =HourlyWage))+
#   geom_boxplot()
# 
# gig %>% ggplot(aes(y =HourlyWage, colour = Industry))+
#   geom_boxplot()+
#   theme_bw
# 
# gig %>% ggplot(aes(y =HourlyWage, colour = Job))+
#   geom_boxplot()+
#   theme_bw()
# 
# 
# 
# gig %>% ggplot(aes(x =HourlyWage))+
#   geom_histogram()+
#   theme_bw()
# 
 gig %>% ggplot(aes(x =HourlyWage))+
   geom_histogram(aes(fill = Industry))+
   theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image

Getting Fancy!

Let’s look at a different dataset. This one is all about penguins. Intrigued? (https://allisonhorst.github.io/palmerpenguins/)

## Look at the data



## Look at penguins

penguins
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

References:

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Business Analytics Copyright © by Amy Goldlist is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book