Preprocessing: Scaling and Dummies¶

We will do a few things:

Open the dataset
Scale what needs scaling
Dummy Variables
Save as a new file

In [1]:

## Import some libraries
#from sklearn import datasets

#from sklearn.metrics import accuracy_score
#from sklearn.metrics import silhouette_score
#from sklearn.cluster import KMeans
#from sklearn.cluster import AgglomerativeClustering 
#import numpy as np
#import matplotlib.pyplot as plt
#from scipy import stats
import pandas 

#%matplotlib inline


## Start with packages

import numpy as np
#import pandas
## this lets me datestamp files: a thing that is really nice
from datetime import date

import os

In [2]:

## We are also going to read in the student surey we looked at earlier.
## I'm using a more current version, because you guys filled it out.

survey = pandas.read_csv("data\Clean_Survey_2023-01-26.csv") 
survey.describe().round(2)

Out[2]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Age	Height	Weight.	Old_Salary	Expected_Salary	Expenses	Transportation_Cost	Entertainment	Cellphone_Cost	Footsize	Alcohol	Sleep	Social_Network	Homework	Work	Coffee
count	164.00	154.00	161.00	156.00	146.00	154.00	144.00	157.00	160.00	145.00	139.00	163.00	150.00	155.00	159.00	158.00
mean	27.90	171.12	77.88	2659.50	4566.69	1246.88	80.65	181.85	64.60	91.72	1.16	8.49	7.77	14.34	21.64	1.18
std	8.16	10.94	42.60	1806.06	1518.18	983.81	74.80	160.52	25.46	859.31	1.47	9.91	6.09	11.79	17.66	1.00
min	10.00	147.00	43.00	0.00	0.00	0.00	0.00	0.00	0.00	4.00	0.00	4.00	0.00	0.00	0.00	0.00
25%	24.00	164.00	60.00	1400.00	3850.00	400.00	28.00	50.00	50.00	9.00	0.00	6.00	2.00	5.00	3.00	0.31
50%	26.00	170.00	70.00	2700.00	4500.00	1000.00	50.00	100.00	60.00	9.88	0.00	7.00	7.00	10.00	20.00	1.00
75%	30.00	179.75	82.00	4000.00	5400.00	2000.00	100.00	250.00	80.00	10.43	2.00	8.00	10.00	22.00	40.00	2.00
max	90.00	220.00	506.00	8000.00	8600.00	4000.00	300.00	600.00	140.00	10316.00	5.00	78.00	25.00	46.00	60.00	4.00

In [3]:

survey.head()

Out[3]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Timestamp	program	Gender	Age	Height	Weight.	Old_Salary	Expected_Salary	Organization	Live_parents	…	Footsize	Alcohol	Tattoo	Sleep	Social_Network	Homework	Work	Travel	Tuition	Coffee
0	2019/02/04 11:26:35 am GMT-8	HRMG	Female	22.0	177.80	110.0	1400.0	4000.0	3. Non-profit or Government Agency	No	…	11.12	0.0	Yes	8.0	10.0	10.0	0.0	Yes	No	0.0
1	2019/02/04 11:27:45 am GMT-8	HRMG	Female	24.0	175.00	63.0	2000.0	3750.0	2. Small Company	Yes	…	9.50	0.0	Yes	8.0	7.0	20.0	0.0	Yes	Yes	2.0
2	2019/02/04 11:27:46 am GMT-8	HRMG	Male	24.0	182.88	100.0	2000.0	3500.0	2. Small Company	Yes	…	10.50	1.0	No	6.0	20.0	20.0	8.0	Yes	No	0.0
3	2019/02/04 11:29:06 am GMT-8	HRMG	Female	34.0	NaN	61.3	2500.0	3500.0	1. Large Corporation	Yes	…	7.00	0.0	Yes	78.0	14.0	20.0	0.0	Yes	No	1.0
4	2019/02/04 11:29:12 am GMT-8	HRMG	Female	24.0	160.00	54.0	0.0	3800.0	2. Small Company	Yes	…	8.00	3.0	No	67.0	6.0	11.0	0.0	Yes	Yes	1.0

5 rows × 27 columns

In [4]:

### We will deal with a smaller version of this dataset:

##df stands for data.frame -> the pandas structure
##.dropna() will get rid of blanks, for today.

df = survey[["Age", "Height", "Coffee", "Tattoo"]].dropna()

df.head()

Out[4]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Age	Height	Coffee	Tattoo
0	22.0	177.80	0.0	Yes
1	24.0	175.00	2.0	Yes
2	24.0	182.88	0.0	No
4	24.0	160.00	1.0	No
5	24.0	164.00	0.0	No

Look at some Summary stats¶

In [5]:

##Let's look at our summary statistics!

df.describe(include = 'all').round(2)

Out[5]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Age	Height	Coffee	Tattoo
count	148.00	148.00	148.00	148
unique	NaN	NaN	NaN	2
top	NaN	NaN	NaN	No
freq	NaN	NaN	NaN	109
mean	27.89	170.99	1.18	NaN
std	8.06	11.13	1.00	NaN
min	19.00	147.00	0.00	NaN
25%	24.00	163.00	0.44	NaN
50%	26.00	170.00	1.00	NaN
75%	30.00	180.00	2.00	NaN
max	90.00	220.00	4.00	NaN

Scaling¶

We will deal with the Quantitative columns seperately.

In [6]:

## make a smallef dataframe with just quant columns:
dfq = df[["Age", "Height", "Coffee"]]

dfq.head()

Out[6]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Age	Height	Coffee
0	22.0	177.80	0.0
1	24.0	175.00	2.0
2	24.0	182.88	0.0
4	24.0	160.00	1.0
5	24.0	164.00	0.0

In [7]:

##we can quickly fix things with a lambda function
## If you're adventurous, explore the preprocessing functions of sklearn!

df2=dfq.apply(lambda x: (x - np.mean(x)) / (np.std(x)))

df2.head()

Out[7]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Age	Height	Coffee
0	-0.732898	0.613901	-1.184843
1	-0.483831	0.361372	0.822101
2	-0.483831	1.072062	-1.184843
4	-0.483831	-0.991465	-0.181371
5	-0.483831	-0.630708	-1.184843

In [13]:

##Let's summarize again!

df2.describe().round(2)

Out[13]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Age	Height	Coffee
count	1.480000e+02	1.480000e+02	1.480000e+02
mean	-2.160434e-16	2.790561e-16	4.800964e-17
std	1.003396e+00	1.003396e+00	1.003396e+00
min	-1.106500e+00	-2.163923e+00	-1.184843e+00
25%	-4.838307e-01	-7.208974e-01	-7.458238e-01
50%	-2.347631e-01	-8.957364e-02	-1.813708e-01
75%	2.633722e-01	8.123175e-01	8.221012e-01
max	7.735402e+00	4.419882e+00	2.829045e+00

Dummy Variables¶

We will now deal with coffee!

In [9]:

## this created just the dummy:
tattoo_d = pandas.get_dummies(df['Tattoo'], prefix = 'Tattoo')

##look:

tattoo_d.head()

Out[9]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Tattoo_No	Tattoo_Yes
0	0	1
1	0	1
2	1	0
4	1	0
5	1	0

In [10]:

## Join things back up
## you can use any join method you like:  Concat also works.

df3= df2.join(tattoo_d)

df3

Out[10]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Age	Height	Coffee	Tattoo_No	Tattoo_Yes
0	-0.732898	0.613901	-1.184843	0	1
1	-0.483831	0.361372	0.822101	0	1
2	-0.483831	1.072062	-1.184843	1	0
4	-0.483831	-0.991465	-0.181371	1	0
5	-0.483831	-0.630708	-1.184843	1	0
…	…	…	…	…	…
159	-0.234763	-0.901276	0.822101	1	0
160	-0.608365	0.271183	-0.181371	1	0
162	-0.359297	-0.089574	-0.181371	1	0
163	0.014305	0.631939	-1.184843	1	0
164	-0.234763	0.902507	-0.181371	1	0

148 rows × 5 columns

Saving¶

We will save our clean dataset as a new file – now we don’t need to use the old one.

Strangely enough, we did LESS well with scaling! this is something we would want to explore in the future

In [11]:

## write!  It's nice to datestamp files, if you make multiple ones:
## you will need to create the folder "clean_data", if you dont' have one already

today = str(date.today())
filename = "clean_data/small_survey_"+ today+".csv"

df3.to_csv(filename)
print("saved to: ", filename)

saved to:  clean_data/small_survey2024-02-13.csv

In [14]:

## In my next notbook, I can just pull up the clean data:

dat = pandas.read_csv("clean_data/small_survey2024-02-13.csv")

## If you are running this, you may need to change the date above.

print("New Shape: ", dat.shape)

dat.head()

New Shape:  (148, 6)

Out[14]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	Unnamed: 0	Age	Height	Coffee	Tattoo_No	Tattoo_Yes
0	0	-0.732898	0.613901	-1.184843	0	1
1	1	-0.483831	0.361372	0.822101	0	1
2	2	-0.483831	1.072062	-1.184843	1	0
3	4	-0.483831	-0.991465	-0.181371	1	0
4	5	-0.483831	-0.630708	-1.184843	1	0

In [ ]:

13 Preprocessing Scaling and Dummies

Preprocessing: Scaling and Dummies¶

Look at some Summary stats¶

Scaling¶

Dummy Variables¶

Saving¶

License

Share This Book