13 Preprocessing Scaling and Dummies
Preprocessing: Scaling and Dummies¶
We will do a few things:
- Open the dataset
- Scale what needs scaling
- Dummy Variables
- Save as a new file
## Import some libraries
#from sklearn import datasets
#from sklearn.metrics import accuracy_score
#from sklearn.metrics import silhouette_score
#from sklearn.cluster import KMeans
#from sklearn.cluster import AgglomerativeClustering
#import numpy as np
#import matplotlib.pyplot as plt
#from scipy import stats
import pandas
#%matplotlib inline
## Start with packages
import numpy as np
#import pandas
## this lets me datestamp files: a thing that is really nice
from datetime import date
import os
## We are also going to read in the student surey we looked at earlier.
## I'm using a more current version, because you guys filled it out.
survey = pandas.read_csv("data\Clean_Survey_2023-01-26.csv")
survey.describe().round(2)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Height | Weight. | Old_Salary | Expected_Salary | Expenses | Transportation_Cost | Entertainment | Cellphone_Cost | Footsize | Alcohol | Sleep | Social_Network | Homework | Work | Coffee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 164.00 | 154.00 | 161.00 | 156.00 | 146.00 | 154.00 | 144.00 | 157.00 | 160.00 | 145.00 | 139.00 | 163.00 | 150.00 | 155.00 | 159.00 | 158.00 |
mean | 27.90 | 171.12 | 77.88 | 2659.50 | 4566.69 | 1246.88 | 80.65 | 181.85 | 64.60 | 91.72 | 1.16 | 8.49 | 7.77 | 14.34 | 21.64 | 1.18 |
std | 8.16 | 10.94 | 42.60 | 1806.06 | 1518.18 | 983.81 | 74.80 | 160.52 | 25.46 | 859.31 | 1.47 | 9.91 | 6.09 | 11.79 | 17.66 | 1.00 |
min | 10.00 | 147.00 | 43.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | 0.00 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 |
25% | 24.00 | 164.00 | 60.00 | 1400.00 | 3850.00 | 400.00 | 28.00 | 50.00 | 50.00 | 9.00 | 0.00 | 6.00 | 2.00 | 5.00 | 3.00 | 0.31 |
50% | 26.00 | 170.00 | 70.00 | 2700.00 | 4500.00 | 1000.00 | 50.00 | 100.00 | 60.00 | 9.88 | 0.00 | 7.00 | 7.00 | 10.00 | 20.00 | 1.00 |
75% | 30.00 | 179.75 | 82.00 | 4000.00 | 5400.00 | 2000.00 | 100.00 | 250.00 | 80.00 | 10.43 | 2.00 | 8.00 | 10.00 | 22.00 | 40.00 | 2.00 |
max | 90.00 | 220.00 | 506.00 | 8000.00 | 8600.00 | 4000.00 | 300.00 | 600.00 | 140.00 | 10316.00 | 5.00 | 78.00 | 25.00 | 46.00 | 60.00 | 4.00 |
survey.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Timestamp | program | Gender | Age | Height | Weight. | Old_Salary | Expected_Salary | Organization | Live_parents | … | Footsize | Alcohol | Tattoo | Sleep | Social_Network | Homework | Work | Travel | Tuition | Coffee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019/02/04 11:26:35 am GMT-8 | HRMG | Female | 22.0 | 177.80 | 110.0 | 1400.0 | 4000.0 | 3. Non-profit or Government Agency | No | … | 11.12 | 0.0 | Yes | 8.0 | 10.0 | 10.0 | 0.0 | Yes | No | 0.0 |
1 | 2019/02/04 11:27:45 am GMT-8 | HRMG | Female | 24.0 | 175.00 | 63.0 | 2000.0 | 3750.0 | 2. Small Company | Yes | … | 9.50 | 0.0 | Yes | 8.0 | 7.0 | 20.0 | 0.0 | Yes | Yes | 2.0 |
2 | 2019/02/04 11:27:46 am GMT-8 | HRMG | Male | 24.0 | 182.88 | 100.0 | 2000.0 | 3500.0 | 2. Small Company | Yes | … | 10.50 | 1.0 | No | 6.0 | 20.0 | 20.0 | 8.0 | Yes | No | 0.0 |
3 | 2019/02/04 11:29:06 am GMT-8 | HRMG | Female | 34.0 | NaN | 61.3 | 2500.0 | 3500.0 | 1. Large Corporation | Yes | … | 7.00 | 0.0 | Yes | 78.0 | 14.0 | 20.0 | 0.0 | Yes | No | 1.0 |
4 | 2019/02/04 11:29:12 am GMT-8 | HRMG | Female | 24.0 | 160.00 | 54.0 | 0.0 | 3800.0 | 2. Small Company | Yes | … | 8.00 | 3.0 | No | 67.0 | 6.0 | 11.0 | 0.0 | Yes | Yes | 1.0 |
5 rows × 27 columns
### We will deal with a smaller version of this dataset:
##df stands for data.frame -> the pandas structure
##.dropna() will get rid of blanks, for today.
df = survey[["Age", "Height", "Coffee", "Tattoo"]].dropna()
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Height | Coffee | Tattoo | |
---|---|---|---|---|
0 | 22.0 | 177.80 | 0.0 | Yes |
1 | 24.0 | 175.00 | 2.0 | Yes |
2 | 24.0 | 182.88 | 0.0 | No |
4 | 24.0 | 160.00 | 1.0 | No |
5 | 24.0 | 164.00 | 0.0 | No |
Look at some Summary stats¶
##Let's look at our summary statistics!
df.describe(include = 'all').round(2)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Height | Coffee | Tattoo | |
---|---|---|---|---|
count | 148.00 | 148.00 | 148.00 | 148 |
unique | NaN | NaN | NaN | 2 |
top | NaN | NaN | NaN | No |
freq | NaN | NaN | NaN | 109 |
mean | 27.89 | 170.99 | 1.18 | NaN |
std | 8.06 | 11.13 | 1.00 | NaN |
min | 19.00 | 147.00 | 0.00 | NaN |
25% | 24.00 | 163.00 | 0.44 | NaN |
50% | 26.00 | 170.00 | 1.00 | NaN |
75% | 30.00 | 180.00 | 2.00 | NaN |
max | 90.00 | 220.00 | 4.00 | NaN |
Scaling¶
We will deal with the Quantitative columns seperately.
## make a smallef dataframe with just quant columns:
dfq = df[["Age", "Height", "Coffee"]]
dfq.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Height | Coffee | |
---|---|---|---|
0 | 22.0 | 177.80 | 0.0 |
1 | 24.0 | 175.00 | 2.0 |
2 | 24.0 | 182.88 | 0.0 |
4 | 24.0 | 160.00 | 1.0 |
5 | 24.0 | 164.00 | 0.0 |
##we can quickly fix things with a lambda function
## If you're adventurous, explore the preprocessing functions of sklearn!
df2=dfq.apply(lambda x: (x - np.mean(x)) / (np.std(x)))
df2.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Height | Coffee | |
---|---|---|---|
0 | -0.732898 | 0.613901 | -1.184843 |
1 | -0.483831 | 0.361372 | 0.822101 |
2 | -0.483831 | 1.072062 | -1.184843 |
4 | -0.483831 | -0.991465 | -0.181371 |
5 | -0.483831 | -0.630708 | -1.184843 |
##Let's summarize again!
df2.describe().round(2)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Height | Coffee | |
---|---|---|---|
count | 1.480000e+02 | 1.480000e+02 | 1.480000e+02 |
mean | -2.160434e-16 | 2.790561e-16 | 4.800964e-17 |
std | 1.003396e+00 | 1.003396e+00 | 1.003396e+00 |
min | -1.106500e+00 | -2.163923e+00 | -1.184843e+00 |
25% | -4.838307e-01 | -7.208974e-01 | -7.458238e-01 |
50% | -2.347631e-01 | -8.957364e-02 | -1.813708e-01 |
75% | 2.633722e-01 | 8.123175e-01 | 8.221012e-01 |
max | 7.735402e+00 | 4.419882e+00 | 2.829045e+00 |
Dummy Variables¶
We will now deal with coffee!
## this created just the dummy:
tattoo_d = pandas.get_dummies(df['Tattoo'], prefix = 'Tattoo')
##look:
tattoo_d.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Tattoo_No | Tattoo_Yes | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 1 |
2 | 1 | 0 |
4 | 1 | 0 |
5 | 1 | 0 |
## Join things back up
## you can use any join method you like: Concat also works.
df3= df2.join(tattoo_d)
df3
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Height | Coffee | Tattoo_No | Tattoo_Yes | |
---|---|---|---|---|---|
0 | -0.732898 | 0.613901 | -1.184843 | 0 | 1 |
1 | -0.483831 | 0.361372 | 0.822101 | 0 | 1 |
2 | -0.483831 | 1.072062 | -1.184843 | 1 | 0 |
4 | -0.483831 | -0.991465 | -0.181371 | 1 | 0 |
5 | -0.483831 | -0.630708 | -1.184843 | 1 | 0 |
… | … | … | … | … | … |
159 | -0.234763 | -0.901276 | 0.822101 | 1 | 0 |
160 | -0.608365 | 0.271183 | -0.181371 | 1 | 0 |
162 | -0.359297 | -0.089574 | -0.181371 | 1 | 0 |
163 | 0.014305 | 0.631939 | -1.184843 | 1 | 0 |
164 | -0.234763 | 0.902507 | -0.181371 | 1 | 0 |
148 rows × 5 columns
Saving¶
We will save our clean dataset as a new file – now we don’t need to use the old one.
Strangely enough, we did LESS well with scaling! this is something we would want to explore in the future
## write! It's nice to datestamp files, if you make multiple ones:
## you will need to create the folder "clean_data", if you dont' have one already
today = str(date.today())
filename = "clean_data/small_survey_"+ today+".csv"
df3.to_csv(filename)
print("saved to: ", filename)
saved to: clean_data/small_survey2024-02-13.csv
## In my next notbook, I can just pull up the clean data:
dat = pandas.read_csv("clean_data/small_survey2024-02-13.csv")
## If you are running this, you may need to change the date above.
print("New Shape: ", dat.shape)
dat.head()
New Shape: (148, 6)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | Age | Height | Coffee | Tattoo_No | Tattoo_Yes | |
---|---|---|---|---|---|---|
0 | 0 | -0.732898 | 0.613901 | -1.184843 | 0 | 1 |
1 | 1 | -0.483831 | 0.361372 | 0.822101 | 0 | 1 |
2 | 2 | -0.483831 | 1.072062 | -1.184843 | 1 | 0 |
3 | 4 | -0.483831 | -0.991465 | -0.181371 | 1 | 0 |
4 | 5 | -0.483831 | -0.630708 | -1.184843 | 1 | 0 |