728x90
🚶♀️Simple salary data로 ML warm-up하기
- 급여 예측하기! -
급여 예측하기!¶
1. 패키지 호출¶
In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
In [45]:
pd.__version__
Out[45]:
'1.3.4'
In [5]:
os.listdir()
Out[5]:
['01SR_Data.csv',
'02.Classification_with_Python.ipynb',
'03.Classification_with_scikitlearn(Titanic).ipynb',
'.ipynb_checkpoints',
'01.Regression_with_Python.ipynb',
'03Titanic_dataset.csv',
'02Social_Network_Ads.csv']
2. 데이터 with pandas DataFrame¶
In [6]:
df= pd.read_csv("01SR_Data.csv")
3. 데이터 확인¶
3-1. 데이터 살펴보기¶
In [19]:
df.shape
Out[19]:
(10, 4)
In [13]:
df.head()
Out[13]:
Country | Age | Year | Salary | |
---|---|---|---|---|
0 | Spain | 27.0 | 3.0 | 48000 |
1 | Spain | NaN | 6.0 | 52000 |
2 | Germany | 30.0 | 2.0 | 54000 |
3 | France | 35.0 | NaN | 58000 |
4 | Spain | 38.0 | NaN | 61000 |
In [14]:
df.columns
Out[14]:
Index(['Country', 'Age', 'Year', 'Salary'], dtype='object')
3-2. 데이터 정보확인¶
In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 10 non-null object
1 Age 9 non-null float64
2 Year 7 non-null float64
3 Salary 10 non-null int64
dtypes: float64(2), int64(1), object(1)
memory usage: 448.0+ bytes
3-3. 데이터 설명보기¶
In [12]:
df.describe(include="all")
Out[12]:
Country | Age | Year | Salary | |
---|---|---|---|---|
count | 10 | 9.000000 | 7.000000 | 10.000000 |
unique | 3 | NaN | NaN | NaN |
top | France | NaN | NaN | NaN |
freq | 4 | NaN | NaN | NaN |
mean | NaN | 38.777778 | 9.142857 | 63500.000000 |
std | NaN | 7.693793 | 6.817345 | 11597.413505 |
min | NaN | 27.000000 | 2.000000 | 48000.000000 |
25% | NaN | 35.000000 | 4.500000 | 55000.000000 |
50% | NaN | 38.000000 | 7.000000 | 61000.000000 |
75% | NaN | 44.000000 | 12.500000 | 70750.000000 |
max | NaN | 50.000000 | 21.000000 | 83000.000000 |
4. feature/label 나누기¶
In [78]:
feature = df[['Country', 'Age', 'Year']]
label = df[['Salary']]
- df.iloc[행, 열] 접근으로 나눠줄수 있다.
==> 따라서 feature를 추출할 때에 위와 동일한 식으로는df.iloc[0:10, 0:3]
또는 df.iloc[:, :3]
또는 df.iloc[:, :-1]
로 사용가능하다
In [53]:
feature.head()
Out[53]:
Country | Age | Year | |
---|---|---|---|
0 | Spain | 27.0 | 3.0 |
1 | Spain | NaN | 6.0 |
2 | Germany | 30.0 | 2.0 |
3 | France | 35.0 | NaN |
4 | Spain | 38.0 | NaN |
In [81]:
label.head()
Out[81]:
Salary | |
---|---|
0 | 48000 |
1 | 52000 |
2 | 54000 |
3 | 58000 |
4 | 61000 |
5. 비어있는 값 채우기(mean)¶
In [22]:
df.isnull().sum()
Out[22]:
Country 0
Age 1
Year 3
Salary 0
dtype: int64
In [24]:
!pip install missingno
Collecting missingno
Downloading missingno-0.5.1-py3-none-any.whl (8.7 kB)
Requirement already satisfied: scipy in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (1.7.1)
Requirement already satisfied: matplotlib in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (3.4.3)
Requirement already satisfied: seaborn in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (0.11.2)
Requirement already satisfied: numpy in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (1.21.3)
Requirement already satisfied: python-dateutil>=2.7 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (3.0.4)
Requirement already satisfied: pillow>=6.2.0 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (8.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: six in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->missingno) (1.15.0)
Requirement already satisfied: pandas>=0.23 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from seaborn->missingno) (1.3.4)
Requirement already satisfied: pytz>=2017.3 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from pandas>=0.23->seaborn->missingno) (2021.3)
Installing collected packages: missingno
Successfully installed missingno-0.5.1
In [25]:
import missingno as msno
msno.matrix(df)
Out[25]:
<AxesSubplot:>
방법1¶
In [34]:
df.mean(numeric_only=True)
Out[34]:
Age 38.777778
Year 9.142857
Salary 63500.000000
dtype: float64
In [37]:
df.fillna(df.mean(numeric_only=True),inplace=True)
In [38]:
df
Out[38]:
Country | Age | Year | Salary | |
---|---|---|---|---|
0 | Spain | 27.000000 | 3.000000 | 48000 |
1 | Spain | 38.777778 | 6.000000 | 52000 |
2 | Germany | 30.000000 | 2.000000 | 54000 |
3 | France | 35.000000 | 9.142857 | 58000 |
4 | Spain | 38.000000 | 9.142857 | 61000 |
5 | Germany | 40.000000 | 10.000000 | 61000 |
6 | France | 37.000000 | 7.000000 | 67000 |
7 | France | 44.000000 | 15.000000 | 72000 |
8 | France | 48.000000 | 9.142857 | 79000 |
9 | Germany | 50.000000 | 21.000000 | 83000 |
방법2¶
⭐위와 동일하게 sklearn을 이용하여 결측값 평균값으로 채워넣기
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")
mean_imputer.fit(feature.iloc[:,1:])
feature.iloc[:,1:] = mean_imputer.transform(feature.iloc[:,1:])
또는
feature.iloc[:,1:] = mean_imputer.fit_ transform(feature.iloc[:,1:])
feature.isnull().sum()
6. One hot encoding¶
- Country 열
방법1¶
⭐sklearn을 이용하여 결측값 평균값으로 채워넣기
In [150]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer( [("one_hot",OneHotEncoder(),[0])],
remainder= 'passthrough')
feature2 = ct.fit_transform(feature)
print(feature2)
[[ 0. 0. 1. 27. 3. ]
[ 0. 0. 1. 38.77777778 6. ]
[ 0. 1. 0. 30. 2. ]
[ 1. 0. 0. 35. 9.14285714]
[ 0. 0. 1. 38. 9.14285714]
[ 0. 1. 0. 40. 10. ]
[ 1. 0. 0. 37. 7. ]
[ 1. 0. 0. 44. 15. ]
[ 1. 0. 0. 48. 9.14285714]
[ 0. 1. 0. 50. 21. ]]
방법2¶
In [44]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
In [48]:
ohe_result = ohe.fit_transform(df.Country.values.reshape(-1,1))
In [50]:
ohe_columns = ohe.get_feature_names(["Country"])
/anaconda/envs/py38_tensorflow/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
In [54]:
sub1 = pd.DataFrame(data= ohe_result.toarray(), columns=ohe_columns)
In [55]:
sub1.head()
Out[55]:
Country_France | Country_Germany | Country_Spain | |
---|---|---|---|
0 | 0.0 | 0.0 | 1.0 |
1 | 0.0 | 0.0 | 1.0 |
2 | 0.0 | 1.0 | 0.0 |
3 | 1.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 1.0 |
In [70]:
sub2 = feature.iloc[:,1:]
sub2
Out[70]:
Age | Year | |
---|---|---|
0 | 27.000000 | 3.000000 |
1 | 38.777778 | 6.000000 |
2 | 30.000000 | 2.000000 |
3 | 35.000000 | 9.142857 |
4 | 38.000000 | 9.142857 |
5 | 40.000000 | 10.000000 |
6 | 37.000000 | 7.000000 |
7 | 44.000000 | 15.000000 |
8 | 48.000000 | 9.142857 |
9 | 50.000000 | 21.000000 |
In [71]:
df2 = pd.concat([sub1,sub2], axis=1)
df2
Out[71]:
Country_France | Country_Germany | Country_Spain | Age | Year | |
---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 27.000000 | 3.000000 |
1 | 0.0 | 0.0 | 1.0 | 38.777778 | 6.000000 |
2 | 0.0 | 1.0 | 0.0 | 30.000000 | 2.000000 |
3 | 1.0 | 0.0 | 0.0 | 35.000000 | 9.142857 |
4 | 0.0 | 0.0 | 1.0 | 38.000000 | 9.142857 |
5 | 0.0 | 1.0 | 0.0 | 40.000000 | 10.000000 |
6 | 1.0 | 0.0 | 0.0 | 37.000000 | 7.000000 |
7 | 1.0 | 0.0 | 0.0 | 44.000000 | 15.000000 |
8 | 1.0 | 0.0 | 0.0 | 48.000000 | 9.142857 |
9 | 0.0 | 1.0 | 0.0 | 50.000000 | 21.000000 |
7. Split Data¶
In [72]:
from sklearn.model_selection import train_test_split
In [151]:
x_train, x_test, y_train, y_test =\
train_test_split(df2, label, test_size=0.2, random_state=42)
In [152]:
# 확인하기!
print ('Training Set: %d rows\nTest Set: %d rows' % (x_train.shape[0], x_test.shape[0]))
Training Set: 8 rows
Test Set: 2 rows
8. Train¶
8-1. Train¶
In [161]:
# 선형회귀 모델 학습시키기!
from sklearn.linear_model import LinearRegression
model_1= LinearRegression()
model_1.fit(x_train, y_train)
Out[161]:
LinearRegression()
In [162]:
print(model_1.score(x_train, y_train))
0.9259552934028783
8-1. Train_2 DecisionTreeRegressor¶
In [165]:
from sklearn.tree import DecisionTreeRegressor
model_2 = DecisionTreeRegressor()
model_2.fit(x_train, y_train)
Out[165]:
DecisionTreeRegressor()
In [166]:
print(model_2.score(x_train, y_train))
1.0
9. Score¶
9-1. Linear regression¶
In [173]:
# 선형회귀 모델
predictions_1 = model_1.predict(x_test)
np.set_printoptions(suppress=True)
print('Predicted labels: ','\n', np.round(predictions_1))
print()
print('Actual labels : ','\n' ,y_test)
Predicted labels:
[[78391.]
[63072.]]
Actual labels :
Salary
8 79000
1 52000
9-2. DecisionTreeRegressor¶
In [174]:
# 의사결정 나무
predictions_2 = model_2.predict(x_test)
np.set_printoptions(suppress=True)
print('Predicted labels: ','\n', np.round(predictions_2))
print()
print('Actual labels : ','\n' ,y_test)
Predicted labels:
[72000. 67000.]
Actual labels :
Salary
8 79000
1 52000
10. Evalute¶
In [181]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
10-1. Linear regression¶
In [183]:
mae1 = mean_absolute_error(y_test, predictions_1)
print("MSE:", mae1)
mse1 = mean_squared_error(y_test, predictions_1)
print("MSE:", mse1)
rmse1 = np.sqrt(mse1)
print("RMSE:", rmse1)
#동일식 mean_squared_error(y_test, predictions_1, squared=True)
r21 = r2_score(y_test, predictions_1)
print("R2:", r21)
MSE: 5840.257958546161
MSE: 61478826.03777762
RMSE: 7840.843451936636
R2: 0.6626676211918924
10-2. DecisionTreeRegressor¶
In [185]:
mae2 = mean_absolute_error(y_test, predictions_2)
print("MSE:", mae2)
mse2 = mean_squared_error(y_test, predictions_2)
print("MSE:", mse2)
rmse2 = np.sqrt(mse2)
print("RMSE:", rmse2)
#동일식 mean_squared_error(y_test,predictions_2, squared=True)
r22 = r2_score(y_test, predictions_2)
print("R2:", r22)
MSE: 11000.0
MSE: 137000000.0
RMSE: 11704.699910719624
R2: 0.2482853223593965
728x90
'😆 Big Data > - ML & DL' 카테고리의 다른 글
[ML]🛳️원본 Titanic data로 머신러닝하기 (0) | 2022.03.16 |
---|---|
[ML]🚶♀️Simple purchase data로 머신러닝 (0) | 2022.03.16 |
[ML] 🤸 5. 피처 엔지니어링 (Feature Engineering) (0) | 2022.03.01 |
[ML] 🤸 4. 머신러닝 알고리즘 평가 (0) | 2022.03.01 |
[ML] 🤸 3. 머신러닝 알고리즘 (0) | 2022.03.01 |