😆 Big Data/- ML & DL

[ML]🚶‍♀️Simple salary data로 ML warm-up하기

또방91 2022. 3. 15. 15:22

728x90

🚶‍♀️Simple salary data로 ML warm-up하기

- 급여 예측하기! -

급여 예측하기!¶

1. 패키지 호출¶

In [4]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os

In [45]:

pd.__version__

Out[45]:

'1.3.4'

In [5]:

os.listdir()

Out[5]:

['01SR_Data.csv',
 '02.Classification_with_Python.ipynb',
 '03.Classification_with_scikitlearn(Titanic).ipynb',
 '.ipynb_checkpoints',
 '01.Regression_with_Python.ipynb',
 '03Titanic_dataset.csv',
 '02Social_Network_Ads.csv']

2. 데이터 with pandas DataFrame¶

In [6]:

df= pd.read_csv("01SR_Data.csv")

3. 데이터 확인¶

3-1. 데이터 살펴보기¶

In [19]:

df.shape

Out[19]:

(10, 4)

In [13]:

df.head()

Out[13]:

	Country	Age	Year	Salary
0	Spain	27.0	3.0	48000
1	Spain	NaN	6.0	52000
2	Germany	30.0	2.0	54000
3	France	35.0	NaN	58000
4	Spain	38.0	NaN	61000

In [14]:

df.columns

Out[14]:

Index(['Country', 'Age', 'Year', 'Salary'], dtype='object')

3-2. 데이터 정보확인¶

In [20]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Country  10 non-null     object 
 1   Age      9 non-null      float64
 2   Year     7 non-null      float64
 3   Salary   10 non-null     int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 448.0+ bytes

3-3. 데이터 설명보기¶

In [12]:

df.describe(include="all")

Out[12]:

	Country	Age	Year	Salary
count	10	9.000000	7.000000	10.000000
unique	3	NaN	NaN	NaN
top	France	NaN	NaN	NaN
freq	4	NaN	NaN	NaN
mean	NaN	38.777778	9.142857	63500.000000
std	NaN	7.693793	6.817345	11597.413505
min	NaN	27.000000	2.000000	48000.000000
25%	NaN	35.000000	4.500000	55000.000000
50%	NaN	38.000000	7.000000	61000.000000
75%	NaN	44.000000	12.500000	70750.000000
max	NaN	50.000000	21.000000	83000.000000

4. feature/label 나누기¶

In [78]:

feature = df[['Country', 'Age', 'Year']]
label = df[['Salary']]

df.iloc[행, 열] 접근으로 나눠줄수 있다.

==> 따라서 feature를 추출할 때에 위와 동일한 식으로는
df.iloc[0:10, 0:3]또는 df.iloc[:, :3] 또는 df.iloc[:, :-1]로 사용가능하다

In [53]:

feature.head()

Out[53]:

	Country	Age	Year
0	Spain	27.0	3.0
1	Spain	NaN	6.0
2	Germany	30.0	2.0
3	France	35.0	NaN
4	Spain	38.0	NaN

In [81]:

label.head()

Out[81]:

	Salary
0	48000
1	52000
2	54000
3	58000
4	61000

5. 비어있는 값 채우기(mean)¶

In [22]:

df.isnull().sum()

Out[22]:

Country    0
Age        1
Year       3
Salary     0
dtype: int64

In [24]:

!pip install missingno

Collecting missingno
  Downloading missingno-0.5.1-py3-none-any.whl (8.7 kB)
Requirement already satisfied: scipy in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (1.7.1)
Requirement already satisfied: matplotlib in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (3.4.3)
Requirement already satisfied: seaborn in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (0.11.2)
Requirement already satisfied: numpy in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from missingno) (1.21.3)
Requirement already satisfied: python-dateutil>=2.7 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (3.0.4)
Requirement already satisfied: pillow>=6.2.0 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (8.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: six in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from cycler>=0.10->matplotlib->missingno) (1.15.0)
Requirement already satisfied: pandas>=0.23 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from seaborn->missingno) (1.3.4)
Requirement already satisfied: pytz>=2017.3 in /anaconda/envs/py38_tensorflow/lib/python3.8/site-packages (from pandas>=0.23->seaborn->missingno) (2021.3)
Installing collected packages: missingno
Successfully installed missingno-0.5.1

In [25]:

import missingno as msno
msno.matrix(df)

Out[25]:

<AxesSubplot:>

방법1¶

In [34]:

df.mean(numeric_only=True)

Out[34]:

Age          38.777778
Year          9.142857
Salary    63500.000000
dtype: float64

In [37]:

df.fillna(df.mean(numeric_only=True),inplace=True)

In [38]:

df

Out[38]:

	Country	Age	Year	Salary
0	Spain	27.000000	3.000000	48000
1	Spain	38.777778	6.000000	52000
2	Germany	30.000000	2.000000	54000
3	France	35.000000	9.142857	58000
4	Spain	38.000000	9.142857	61000
5	Germany	40.000000	10.000000	61000
6	France	37.000000	7.000000	67000
7	France	44.000000	15.000000	72000
8	France	48.000000	9.142857	79000
9	Germany	50.000000	21.000000	83000

방법2¶

⭐위와 동일하게 sklearn을 이용하여 결측값 평균값으로 채워넣기

from sklearn.impute import SimpleImputer

mean_imputer = SimpleImputer(strategy="mean") mean_imputer.fit(feature.iloc[:,1:]) feature.iloc[:,1:] = mean_imputer.transform(feature.iloc[:,1:])

또는

feature.iloc[:,1:] = mean_imputer.fit_ transform(feature.iloc[:,1:])

feature.isnull().sum()

6. One hot encoding¶

Country 열

방법1¶

⭐sklearn을 이용하여 결측값 평균값으로 채워넣기

In [150]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer( [("one_hot",OneHotEncoder(),[0])],
                      remainder= 'passthrough')

feature2 = ct.fit_transform(feature)
print(feature2)

[[ 0.          0.          1.         27.          3.        ]
 [ 0.          0.          1.         38.77777778  6.        ]
 [ 0.          1.          0.         30.          2.        ]
 [ 1.          0.          0.         35.          9.14285714]
 [ 0.          0.          1.         38.          9.14285714]
 [ 0.          1.          0.         40.         10.        ]
 [ 1.          0.          0.         37.          7.        ]
 [ 1.          0.          0.         44.         15.        ]
 [ 1.          0.          0.         48.          9.14285714]
 [ 0.          1.          0.         50.         21.        ]]

방법2¶

In [44]:

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

In [48]:

ohe_result = ohe.fit_transform(df.Country.values.reshape(-1,1))

In [50]:

ohe_columns = ohe.get_feature_names(["Country"])

/anaconda/envs/py38_tensorflow/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

In [54]:

sub1 = pd.DataFrame(data= ohe_result.toarray(), columns=ohe_columns)

In [55]:

sub1.head()

Out[55]:

	Country_France	Country_Germany	Country_Spain
0	0.0	0.0	1.0
1	0.0	0.0	1.0
2	0.0	1.0	0.0
3	1.0	0.0	0.0
4	0.0	0.0	1.0

In [70]:

sub2 = feature.iloc[:,1:]
sub2

Out[70]:

	Age	Year
0	27.000000	3.000000
1	38.777778	6.000000
2	30.000000	2.000000
3	35.000000	9.142857
4	38.000000	9.142857
5	40.000000	10.000000
6	37.000000	7.000000
7	44.000000	15.000000
8	48.000000	9.142857
9	50.000000	21.000000

In [71]:

df2 = pd.concat([sub1,sub2], axis=1)
df2

Out[71]:

	Country_France	Country_Germany	Country_Spain	Age	Year
0	0.0	0.0	1.0	27.000000	3.000000
1	0.0	0.0	1.0	38.777778	6.000000
2	0.0	1.0	0.0	30.000000	2.000000
3	1.0	0.0	0.0	35.000000	9.142857
4	0.0	0.0	1.0	38.000000	9.142857
5	0.0	1.0	0.0	40.000000	10.000000
6	1.0	0.0	0.0	37.000000	7.000000
7	1.0	0.0	0.0	44.000000	15.000000
8	1.0	0.0	0.0	48.000000	9.142857
9	0.0	1.0	0.0	50.000000	21.000000

7. Split Data¶

In [72]:

from sklearn.model_selection import train_test_split

In [151]:

x_train, x_test, y_train, y_test =\
train_test_split(df2, label, test_size=0.2, random_state=42)

In [152]:

# 확인하기!
print ('Training Set: %d rows\nTest Set: %d rows' % (x_train.shape[0], x_test.shape[0]))

Training Set: 8 rows
Test Set: 2 rows

8. Train¶

8-1. Train¶

In [161]:

# 선형회귀 모델 학습시키기!
from sklearn.linear_model import LinearRegression

model_1= LinearRegression()
model_1.fit(x_train, y_train)

Out[161]:

LinearRegression()

In [162]:

print(model_1.score(x_train, y_train))

0.9259552934028783

8-1. Train_2 DecisionTreeRegressor¶

In [165]:

from sklearn.tree import DecisionTreeRegressor

model_2 = DecisionTreeRegressor()
model_2.fit(x_train, y_train)

Out[165]:

DecisionTreeRegressor()

In [166]:

print(model_2.score(x_train, y_train))

1.0

9. Score¶

9-1. Linear regression¶

In [173]:

# 선형회귀 모델
predictions_1 = model_1.predict(x_test)
np.set_printoptions(suppress=True)
print('Predicted labels: ','\n', np.round(predictions_1))
print()
print('Actual labels   : ','\n' ,y_test)

Predicted labels:  
 [[78391.]
 [63072.]]

Actual labels   :  
    Salary
8   79000
1   52000

9-2. DecisionTreeRegressor¶

In [174]:

# 의사결정 나무

predictions_2 = model_2.predict(x_test)
np.set_printoptions(suppress=True)
print('Predicted labels: ','\n', np.round(predictions_2))
print()
print('Actual labels   : ','\n' ,y_test)

Predicted labels:  
 [72000. 67000.]

Actual labels   :  
    Salary
8   79000
1   52000

10. Evalute¶

In [181]:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

10-1. Linear regression¶

In [183]:

mae1 = mean_absolute_error(y_test, predictions_1)
print("MSE:", mae1)

mse1 = mean_squared_error(y_test, predictions_1)
print("MSE:", mse1)

rmse1 = np.sqrt(mse1)
print("RMSE:", rmse1)
#동일식 mean_squared_error(y_test, predictions_1, squared=True)

r21 = r2_score(y_test, predictions_1)
print("R2:", r21)

MSE: 5840.257958546161
MSE: 61478826.03777762
RMSE: 7840.843451936636
R2: 0.6626676211918924

10-2. DecisionTreeRegressor¶

In [185]:

mae2 = mean_absolute_error(y_test, predictions_2)
print("MSE:", mae2)

mse2 = mean_squared_error(y_test, predictions_2)
print("MSE:", mse2)

rmse2 = np.sqrt(mse2)
print("RMSE:", rmse2)
#동일식 mean_squared_error(y_test,predictions_2, squared=True)

r22 = r2_score(y_test, predictions_2)
print("R2:", r22)

MSE: 11000.0
MSE: 137000000.0
RMSE: 11704.699910719624
R2: 0.2482853223593965

728x90

'😆 Big Data > - ML & DL' 카테고리의 다른 글

[ML]🛳️원본 Titanic data로 머신러닝하기 (0)	2022.03.16
[ML]🚶‍♀️Simple purchase data로 머신러닝 (0)	2022.03.16
[ML] 🤸 5. 피처 엔지니어링 (Feature Engineering) (0)	2022.03.01
[ML] 🤸 4. 머신러닝 알고리즘 평가 (0)	2022.03.01
[ML] 🤸 3. 머신러닝 알고리즘 (0)	2022.03.01

현재글[ML]🚶‍♀️Simple salary data로 ML warm-up하기

코딩하는 간호사

[ML]🚶‍♀️Simple salary data로 ML warm-up하기

🚶‍♀️Simple salary data로 ML warm-up하기

- 급여 예측하기! -

급여 예측하기!¶

1. 패키지 호출¶

2. 데이터 with pandas DataFrame¶

3. 데이터 확인¶

3-1. 데이터 살펴보기¶

3-2. 데이터 정보확인¶

3-3. 데이터 설명보기¶

4. feature/label 나누기¶

5. 비어있는 값 채우기(mean)¶

방법1¶

방법2¶

6. One hot encoding¶

방법1¶

방법2¶

7. Split Data¶

8. Train¶

8-1. Train¶

8-1. Train_2 DecisionTreeRegressor¶

9. Score¶

9-1. Linear regression¶

9-2. DecisionTreeRegressor¶

10. Evalute¶

10-1. Linear regression¶

10-2. DecisionTreeRegressor¶

'😆 Big Data > - ML & DL' 카테고리의 다른 글

'😆 Big Data/- ML & DL'의 다른글

티스토리툴바

[ML]🚶‍♀️Simple salary data로 ML warm-up하기

🚶‍♀️Simple salary data로 ML warm-up하기

- 급여 예측하기! -

급여 예측하기!¶

1. 패키지 호출¶

2. 데이터 with pandas DataFrame¶

3. 데이터 확인¶

3-1. 데이터 살펴보기¶

3-2. 데이터 정보확인¶

3-3. 데이터 설명보기¶

4. feature/label 나누기¶

5. 비어있는 값 채우기(mean)¶

방법1¶

방법2¶

6. One hot encoding¶

방법1¶

방법2¶

7. Split Data¶

8. Train¶

8-1. Train¶

8-1. Train_2 DecisionTreeRegressor¶

9. Score¶

9-1. Linear regression¶

9-2. DecisionTreeRegressor¶

10. Evalute¶

10-1. Linear regression¶

10-2. DecisionTreeRegressor¶

'😆 Big Data > - ML & DL' 카테고리의 다른 글

'😆 Big Data/- ML & DL'의 다른글

관련글

티스토리툴바