😆 Big Data/- ML & DL

[ML]🚶‍♀️Simple purchase data로 머신러닝

또방91 2022. 3. 16. 15:08

728x90

[ML]🚶‍♀️Simple purchase data로 머신러닝

구매 예측하기!!¶

1. package 가져오기¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings(action='ignore')

In [2]:

os.listdir()

Out[2]:

['01SR_Data.csv',
 '02.Classification_with_Python.ipynb',
 '03.Classification_with_scikitlearn(Titanic).ipynb',
 '.ipynb_checkpoints',
 '01.Regression_with_Python.ipynb',
 '03Titanic_dataset.csv',
 '02Social_Network_Ads.csv']

In [3]:

df = pd.read_csv("02Social_Network_Ads.csv")

2. 데이터 프레임¶

In [4]:

df.head()

Out[4]:

	User ID	Gender	Age	EstimatedSalary	Purchased
0	15566689	Female	35.0	57000.0	0
1	15569641	Female	58.0	95000.0	1
2	15570769	Female	26.0	80000.0	0
3	15570932	Male	34.0	115000.0	0
4	15571059	Female	33.0	41000.0	0

3. 데이터 살펴보기¶

In [5]:

df.shape

Out[5]:

(400, 5)

In [6]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   User ID          400 non-null    int64  
 1   Gender           394 non-null    object 
 2   Age              390 non-null    float64
 3   EstimatedSalary  388 non-null    float64
 4   Purchased        400 non-null    int64  
dtypes: float64(2), int64(2), object(1)
memory usage: 15.8+ KB

In [7]:

df.describe(include="all")

Out[7]:

	User ID	Gender	Age	EstimatedSalary	Purchased
count	4.000000e+02	394	390.000000	388.000000	400.000000
unique	NaN	2	NaN	NaN	NaN
top	NaN	Female	NaN	NaN	NaN
freq	NaN	202	NaN	NaN	NaN
mean	1.569154e+07	NaN	37.782051	69628.865979	0.357500
std	7.165832e+04	NaN	10.452300	33889.337949	0.479864
min	1.556669e+07	NaN	18.000000	15000.000000	0.000000
25%	1.562676e+07	NaN	30.000000	43000.000000	0.000000
50%	1.569434e+07	NaN	37.000000	70000.000000	0.000000
75%	1.575036e+07	NaN	46.000000	87250.000000	1.000000
max	1.581524e+07	NaN	60.000000	150000.000000	1.000000

4. feature/label 분리¶

In [8]:

df["User ID"].nunique()

Out[8]:

==> 중복된 ID가 없기때문에 feature에서 제외해도 무관함

In [9]:

feature = df.iloc[:,1:-1]
label = df.iloc[:, -1:]

In [10]:

feature.head()

Out[10]:

	Gender	Age	EstimatedSalary
0	Female	35.0	57000.0
1	Female	58.0	95000.0
2	Female	26.0	80000.0
3	Male	34.0	115000.0
4	Female	33.0	41000.0

In [11]:

label.head()

Out[11]:

	Purchased
0	0
1	1
2	0
3	0
4	0

5. 빠진 값 확인¶

In [12]:

df.isnull().sum()

Out[12]:

User ID             0
Gender              6
Age                10
EstimatedSalary    12
Purchased           0
dtype: int64

In [13]:

pd.DataFrame(feature.isnull().sum()).T

Out[13]:

	Gender	Age	EstimatedSalary
0	6	10	12

In [14]:

sns.barplot(data=pd.DataFrame(feature.isnull().sum()).T)

Out[14]:

<AxesSubplot:>

6. Clean Missing Data¶

In [15]:

missing= df.loc[(df.Age.isnull())|
       (df.EstimatedSalary.isnull())|
       (df.Gender.isnull())]

In [16]:

missing.shape

Out[16]:

(28, 5)

In [17]:

# 전체 결측치 해당 비율
missing.shape[0]/df.shape[0]*100

Out[17]:

7.000000000000001

In [18]:

feature1 = feature.copy()
feature2 = feature.copy()
label1 = label.copy()
label2 = label.copy()

삭제해도 괜찮다라고 판단되지만,
혹시 대치할만한 값을 찾을 수 있으니 탐색해보자!

6-1. numeric¶

결측치가 있는 Age와 Salary열을 확인함

In [19]:

df.describe()[["Age","EstimatedSalary"]]

Out[19]:

	Age	EstimatedSalary
count	390.000000	388.000000
mean	37.782051	69628.865979
std	10.452300	33889.337949
min	18.000000	15000.000000
25%	30.000000	43000.000000
50%	37.000000	70000.000000
75%	46.000000	87250.000000
max	60.000000	150000.000000

In [20]:

fig, axes = plt.subplots(1,2, figsize=(10,5))
sns.distplot(df["Age"], ax=axes[0])
sns.distplot(df["EstimatedSalary"], ax=axes[1])

Out[20]:

<AxesSubplot:xlabel='EstimatedSalary', ylabel='Density'>

In [21]:

fig, axes = plt.subplots(1,2, figsize=(10,5))
sns.boxplot(data=df, y="Age", ax=axes[0])
sns.boxplot(data=df, y="EstimatedSalary", ax=axes[1])

Out[21]:

<AxesSubplot:ylabel='EstimatedSalary'>

고민해보기¶

In [22]:

age_p = df.Age.isnull().sum()/df.shape[0]*100
salary_p = df.EstimatedSalary.isnull().sum()/df.shape[0]*100

print(f'Age열에서 결측치는 전체 데이터 행 중 {age_p}%에 해당되고,')
print()
print(f'Estimated Salary열에서 결측치는 전체 데이터 행중 {salary_p}%에 해당됨')

Age열에서 결측치는 전체 데이터 행 중 2.5%에 해당되고,

Estimated Salary열에서 결측치는 전체 데이터 행중 3.0%에 해당됨

결과¶

drop 해주기로 정함

In [23]:

# 나이와 급여가 없는 것
not_a_s = feature1[(feature1.Age.isnull())|
                   (feature1.EstimatedSalary.isnull())]
not_a_s

Out[23]:

	Gender	Age	EstimatedSalary
16	Male	23.0	NaN
19	Female	NaN	47000.0
71	Female	41.0	NaN
92	Male	NaN	53000.0
106	Male	47.0	NaN
110	Male	49.0	NaN
127	Male	34.0	NaN
155	Female	26.0	NaN
221	Female	NaN	35000.0
230	Female	35.0	NaN
246	Male	NaN	75000.0
262	Male	30.0	NaN
264	Female	NaN	137000.0
269	Male	47.0	NaN
284	Male	32.0	NaN
289	Male	NaN	86000.0
303	Female	49.0	NaN
320	Female	NaN	47000.0
335	Female	NaN	138000.0
361	Male	NaN	15000.0
365	Female	37.0	NaN
382	Male	NaN	76000.0

In [24]:

not_a_s.index

Out[24]:

Int64Index([ 16,  19,  71,  92, 106, 110, 127, 155, 221, 230, 246, 262, 264,
            269, 284, 289, 303, 320, 335, 361, 365, 382],
           dtype='int64')

In [25]:

feature1.drop(not_a_s.index,inplace=True)

In [26]:

feature1.head()

Out[26]:

	Gender	Age	EstimatedSalary
0	Female	35.0	57000.0
1	Female	58.0	95000.0
2	Female	26.0	80000.0
3	Male	34.0	115000.0
4	Female	33.0	41000.0

In [27]:

# 잘 삭제 됨을 확인!
feature1.shape

Out[27]:

(378, 3)

In [28]:

# 혹시 모르니 평균값 대치방법으로 하나 만들어보자
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")

feature2.iloc[:,1:] = mean_imputer.fit_transform(feature2.iloc[:,1:])

feature2.isnull().sum()

Out[28]:

Gender             6
Age                0
EstimatedSalary    0
dtype: int64

6.2 string¶

In [29]:

df.describe(include="object")

Out[29]:

	Gender
count	394
unique	2
top	Female
freq	202

In [30]:

df[df["Gender"].isnull()]

Out[30]:

	User ID	Gender	Age	EstimatedSalary	Purchased
55	15598044	NaN	27.0	84000.0	0
162	15671766	NaN	26.0	72000.0	0
204	15694946	NaN	24.0	55000.0	0
287	15745083	NaN	26.0	80000.0	0
384	15807481	NaN	28.0	79000.0	0
386	15807837	NaN	48.0	33000.0	1

고민해보기¶

20대가 많기에 20대의 성별 최빈값 넣어야할지 고려하기

In [31]:

gender_20 = df.loc[(df.Age>=20) & (df.Age<30),["Gender"]]
gender_20

Out[31]:

	Gender
2	Female
5	Female
13	Female
16	Male
24	Male
...	...
375	Female
381	Male
384	NaN
389	Male
396	Male

84 rows × 1 columns

In [32]:

gender_20.Gender.unique()

Out[32]:

array(['Female', 'Male', nan], dtype=object)

In [33]:

gender_20.Gender.fillna("no",inplace=True)

In [34]:

gender_20.value_counts()

Out[34]:

Gender
Female    40
Male      39
no         5
dtype: int64

In [35]:

sns.countplot(data=gender_20, x="Gender")

Out[35]:

<AxesSubplot:xlabel='Gender', ylabel='count'>

남/여 비율 차이가 없기에, 대치불가

In [36]:

gender_p = df.Gender.isnull().sum()/df.shape[0]*100

f'결측치는 전체 데이터 행 중 {gender_p}%에 해당'

Out[36]:

'결측치는 전체 데이터 행 중 1.5%에 해당'

결과¶

-> drop 해주기

In [37]:

# 성별 없는것
not_g = df[df.Gender.isnull()]
not_g

Out[37]:

	User ID	Gender	Age	EstimatedSalary	Purchased
55	15598044	NaN	27.0	84000.0	0
162	15671766	NaN	26.0	72000.0	0
204	15694946	NaN	24.0	55000.0	0
287	15745083	NaN	26.0	80000.0	0
384	15807481	NaN	28.0	79000.0	0
386	15807837	NaN	48.0	33000.0	1

In [38]:

not_g.index

Out[38]:

Int64Index([55, 162, 204, 287, 384, 386], dtype='int64')

In [39]:

# 위에서 nuemeric drop 처리해준 feature에 더하여 drop 해주기
feature1= feature1.drop(not_g.index, axis=0)
feature1.head(1)

Out[39]:

	Gender	Age	EstimatedSalary
0	Female	35.0	57000.0

In [40]:

# label에서도 해당 index drop 해주기
hap = list(not_a_s.index) + list(not_g.index)
label1= label.drop(hap)

In [41]:

# index 정렬해주기
feature1.sort_index(inplace=True)
label1.sort_index(inplace=True)

In [42]:

# 혹시모르니 drop하지않고 최빈값으로
string_imputer = SimpleImputer(strategy='most_frequent')
feature2.iloc[:, 0] = string_imputer.fit_transform(feature2.iloc[:, 0:1])

feature2.isnull().sum()

Out[42]:

Gender             0
Age                0
EstimatedSalary    0
dtype: int64

7. One hot encoding¶

In [43]:

feature1

Out[43]:

	Gender	Age	EstimatedSalary
0	Female	35.0	57000.0
1	Female	58.0	95000.0
2	Female	26.0	80000.0
3	Male	34.0	115000.0
4	Female	33.0	41000.0
...	...	...	...
395	Male	40.0	107000.0
396	Male	27.0	20000.0
397	Male	57.0	60000.0
398	Male	31.0	66000.0
399	Female	45.0	131000.0

372 rows × 3 columns

In [44]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer( [("ohe",
                          OneHotEncoder(),[0])],
                       remainder= 'passthrough')

sub1 = ct.fit_transform(feature1)
sub2 = ct.fit_transform(feature2)

In [47]:

feature1.Gender = sub1
feature2.Gender = sub2

In [48]:

print(feature1,"\n\n=========\n\n", feature2)

     Gender   Age  EstimatedSalary
0       1.0  35.0          57000.0
1       1.0  58.0          95000.0
2       1.0  26.0          80000.0
3       0.0  34.0         115000.0
4       1.0  33.0          41000.0
..      ...   ...              ...
395     0.0  40.0         107000.0
396     0.0  27.0          20000.0
397     0.0  57.0          60000.0
398     0.0  31.0          66000.0
399     1.0  45.0         131000.0

[372 rows x 3 columns] 

=========

      Gender   Age  EstimatedSalary
0       1.0  35.0          57000.0
1       1.0  58.0          95000.0
2       1.0  26.0          80000.0
3       0.0  34.0         115000.0
4       1.0  33.0          41000.0
..      ...   ...              ...
395     0.0  40.0         107000.0
396     0.0  27.0          20000.0
397     0.0  57.0          60000.0
398     0.0  31.0          66000.0
399     1.0  45.0         131000.0

[400 rows x 3 columns]

8. split data¶

In [51]:

from sklearn.model_selection import train_test_split

X1_train, X1_test, y1_train, y1_test = train_test_split(feature1, 
                                                    label1,
                                                    test_size=0.2,
                                                    random_state=1)

X2_train, X2_test, y2_train, y2_test = train_test_split(feature2, 
                                                    label2,
                                                    test_size=0.2,
                                                    random_state=1)

print(X1_train.shape)
print(X1_test.shape)
print(X2_train.shape)
print(X2_test.shape)

(297, 3)
(75, 3)
(320, 3)
(80, 3)

9. train¶

In [81]:

# 로지스틱 회귀
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()
logistic.fit(X1_train, y1_train)
logistic.fit(X2_train, y2_train)

Out[81]:

LogisticRegression()

In [54]:

# 의사결정나무 분류
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(X1_train, y1_train)
tree.fit(X2_train, y2_train)

y1_pred_tree = tree.predict(X1_test)
y2_pred_tree = tree.predict(X2_test)

10. Score¶

In [59]:

y1_pred = logistic.predict(X1_test)
y2_pred = logistic.predict(X2_test)
print(y1_pred)
print("\n\n=========\n\n")
print(y2_pred)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0]


=========


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]

11. Evaluate¶

In [66]:

from sklearn.metrics import accuracy_score, recall_score, precision_score

acc1 = accuracy_score(y1_test, y1_pred)
recall1 = recall_score(y1_test, y1_pred)
precision1 = precision_score(y1_test, y1_pred)

print(acc1)
print(recall1)
print(precision1)

print("\n=========\n")
acc2 = accuracy_score(y2_test, y2_pred)
recall2 = recall_score(y2_test, y2_pred)
precision2 = precision_score(y2_test, y2_pred)

print(acc2)
print(recall2)
print(precision2)

0.64
0.0
0.0

=========

0.6125
0.0
0.0

In [65]:

print(accuracy_score(y1_test, y1_pred_tree))

print("\n=========\n")

print(accuracy_score(y2_test, y2_pred_tree))

0.96

=========

0.9125

12. Confusion Matrix¶

In [64]:

from sklearn.metrics import confusion_matrix

cm1 = confusion_matrix(y1_test, y1_pred_tree)
cm2 = confusion_matrix(y2_test, y2_pred_tree)
print(cm1)
print("\n=========\n")
print(cm2)

[[48  0]
 [ 3 24]]

=========

[[46  3]
 [ 4 27]]

13. CM Visualize¶

In [79]:

plt.rc("font", family="Malgun Gothic")

In [80]:

fig, axes= plt.subplots (1,2, figsize=(10,5))
plt.suptitle("drop VS impute")
sns.set(font_scale=1.5)
sns.heatmap(cm1, linewidths=0.5, annot=True, ax=axes[0])
sns.heatmap(cm2, linewidths=0.5,annot=True, ax=axes[1], cmap="YlGnBu")

Out[80]:

<AxesSubplot:>

결론¶

결측치를 drop한 경우가 대치법(수치형은 평균, 문자형은 최빈)보다 더 예측을 잘한 걸로 생각됨!

728x90

'😆 Big Data > - ML & DL' 카테고리의 다른 글

[ML]📊1. Auto-MPG 데이터 - 단순 회귀 분석하기(Simple Linear Regression) (0)	2022.03.17
[ML]🛳️원본 Titanic data로 머신러닝하기 (0)	2022.03.16
[ML]🚶‍♀️Simple salary data로 ML warm-up하기 (0)	2022.03.15
[ML] 🤸 5. 피처 엔지니어링 (Feature Engineering) (0)	2022.03.01
[ML] 🤸 4. 머신러닝 알고리즘 평가 (0)	2022.03.01

현재글[ML]🚶‍♀️Simple purchase data로 머신러닝

코딩하는 간호사

[ML]🚶‍♀️Simple purchase data로 머신러닝

[ML]🚶‍♀️Simple purchase data로 머신러닝

구매 예측하기!!¶

1. package 가져오기¶

2. 데이터 프레임¶

3. 데이터 살펴보기¶

4. feature/label 분리¶

5. 빠진 값 확인¶

6. Clean Missing Data¶

6-1. numeric¶

고민해보기¶

결과¶

6.2 string¶

고민해보기¶

결과¶

7. One hot encoding¶

8. split data¶

9. train¶

10. Score¶

11. Evaluate¶

12. Confusion Matrix¶

13. CM Visualize¶

결론¶

'😆 Big Data > - ML & DL' 카테고리의 다른 글

'😆 Big Data/- ML & DL'의 다른글

티스토리툴바

[ML]🚶‍♀️Simple purchase data로 머신러닝

[ML]🚶‍♀️Simple purchase data로 머신러닝

구매 예측하기!!¶

1. package 가져오기¶

2. 데이터 프레임¶

3. 데이터 살펴보기¶

4. feature/label 분리¶

5. 빠진 값 확인¶

6. Clean Missing Data¶

6-1. numeric¶

고민해보기¶

결과¶

6.2 string¶

고민해보기¶

결과¶

7. One hot encoding¶

8. split data¶

9. train¶

10. Score¶

11. Evaluate¶

12. Confusion Matrix¶

13. CM Visualize¶

결론¶

'😆 Big Data > - ML & DL' 카테고리의 다른 글

'😆 Big Data/- ML & DL'의 다른글

관련글

티스토리툴바