😆 Big Data/- ML & DL

[ML]🛳️원본 Titanic data로 머신러닝하기

또방91 2022. 3. 16. 15:14

728x90

kaggle이나 Seaborn에서 Titanic 데이터를 많이 접해봤을 것이다!

하지만 이 것들은 다 편집본 데이터!

원본 titanic 데이터를 얻었기에, 또 EDA를 멋드러지게 해봐야지!

보다보면 컬럼명이 무시무시한 것도 있다.... body... 번호같은...

타이타닉 탑승객 생존 예측 Classification with Python¶

In [1]:

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings(action='ignore')

In [2]:

os.listdir()

Out[2]:

['01SR_Data.csv',
 '02.Classification_with_Python.ipynb',
 '03.Classification_with_scikitlearn(Titanic).ipynb',
 '.ipynb_checkpoints',
 '01.Regression_with_Python.ipynb',
 '03Titanic_dataset.csv',
 '02Social_Network_Ads.csv',
 'Pandas_Slicing_Practice.ipynb']

1. 데이터 불러오기¶

In [3]:

df = pd.read_csv("03Titanic_dataset.csv")
df.head()

Out[3]:

	pclass	survived	name	gender	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

2. 데이터 확인하기¶

In [4]:

df.columns

Out[4]:

Index(['pclass', 'survived', 'name', 'gender', 'age', 'sibsp', 'parch',
       'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [5]:

df.shape

Out[5]:

(1309, 14)

In [6]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   gender     1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB

3. 빠진 값 확인¶

In [7]:

df.isnull().sum()

Out[7]:

pclass          0
survived        0
name            0
gender          0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

4. 사용하지 않을 feature 제거¶

4-1. 먼저 나누기¶

In [8]:

df.head(1)

Out[8]:

	pclass	survived	name	gender	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO

In [9]:

label = df.survived
feature = df.drop(columns="survived", axis=1)

In [10]:

feature.head()

Out[10]:

	pclass	name	gender	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

4-2. 필요없는 열 제거¶

결측값이 많고 무의미한 컬럼은 삭제
- 결측값 반절정도인 경우 삭제고려: cabin, boat,body,home.dest

cabin의 경우 생존과 관련이 있지만, 배구조 파악이 힘들어 삭제
boat 번호보단 생존여부가 더 중요하므로 삭제
생존여부를 survived로 확인가능하니 body 삭제
home.dest 승하차 도시로부터 부의 도시유무로 생존영향성 평가 도출 고려해보겠으나, 그보단 pclass나 cabin이 보다 연관성있게 찾을 수 있을거라 생각되어 삭제

==> cabin, boat,body,home.dest 삭제

In [11]:

feature.shape

Out[11]:

(1309, 13)

In [12]:

# 적용 전 drop식 확인
feature.drop(columns=["cabin","boat","body","home.dest"], axis=1).head()

Out[12]:

	pclass	name	gender	age	sibsp	parch	ticket	fare	embarked
0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	S
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	S
2	1	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	S
3	1	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	S
4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	S

In [13]:

feature = feature.drop(columns=["cabin","boat","body","home.dest"], axis=1)

# 삭제 잘되었는지 확인
feature.shape

Out[13]:

(1309, 9)

필요없는 열 삭제

In [14]:

# ticket 삭제
feature.drop(columns="ticket", inplace=True)

5. Impute - Fare¶

결측값은 pclass의 median 값으로 대치할 생각임

In [15]:

feature.groupby(feature.pclass).fare.median()

Out[15]:

pclass
1    60.0000
2    15.0458
3     8.0500
Name: fare, dtype: float64

In [16]:

feature.fare.fillna(feature.groupby(feature.pclass).fare.transform("median"),inplace=True)

In [17]:

feature.fare.isnull().sum()

Out[17]:

In [18]:

sns.distplot(df.fare)

Out[18]:

<AxesSubplot:xlabel='fare', ylabel='Density'>

6. Impute - Age¶

In [19]:

feature.age.isnull().sum()

Out[19]:

In [20]:

feature.sample(5)

Out[20]:

	pclass	name	gender	age	sibsp	parch	fare	embarked
658	3	Baclini, Miss. Helene Barbara	female	0.75	2	1	19.2583	C
1270	3	Vande Walle, Mr. Nestor Cyriel	male	28.00	0	0	9.5000	S
200	1	McCaffry, Mr. Thomas Francis	male	46.00	0	0	75.2417	C
733	3	Coutts, Master. Eden Leslie "Neville"	male	9.00	1	1	15.9000	S
652	3	Augustsson, Mr. Albert	male	23.00	0	0	7.8542	S

6-1. 그냥 중앙값으로 대치¶

In [21]:

from sklearn.impute import SimpleImputer

median_imputer = SimpleImputer(strategy="median")

In [22]:

feature["age_median"] = median_imputer.fit_transform(feature.iloc[:,3:4])

In [23]:

feature.age_median.isnull().sum()

Out[23]:

In [24]:

feature.head()

Out[24]:

	pclass	name	gender	age	sibsp	parch	fare	embarked	age_median
0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	211.3375	S	29.0000
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	151.5500	S	0.9167
2	1	Allison, Miss. Helen Loraine	female	2.0000	1	2	151.5500	S	2.0000
3	1	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	151.5500	S	30.0000
4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	151.5500	S	25.0000

6-2. 이름에 따른 중앙값으로 대치¶

이름에 "Mr","Miss","Mrs"가 들어간다는 것을 보고 생각함. 라벨 인코딩처럼 "Mr": 0, "Miss": 1, "Mrs": 2, 이외: 3으로 붙이기로 정함

title¶

In [25]:

title = list()
for x in feature.name:
    result = x.split(",")[1].split('.')[0]
    title.append(result)

In [26]:

feature["title"] = title
feature.head(1)

Out[26]:

	pclass	name	gender	age	sibsp	parch	fare	embarked	age_median	title
0	1	Allen, Miss. Elisabeth Walton	female	29.0	0	0	211.3375	S	29.0	Miss

In [27]:

feature.title.value_counts()

Out[27]:

 Mr              757
 Miss            260
 Mrs             197
 Master           61
 Rev               8
 Dr                8
 Col               4
 Mlle              2
 Ms                2
 Major             2
 Capt              1
 Sir               1
 Dona              1
 Jonkheer          1
 the Countess      1
 Don               1
 Mme               1
 Lady              1
Name: title, dtype: int64

In [28]:

feature.title.unique()

Out[28]:

array([' Miss', ' Master', ' Mr', ' Mrs', ' Col', ' Mme', ' Dr', ' Major',
       ' Capt', ' Lady', ' Sir', ' Mlle', ' Dona', ' Jonkheer',
       ' the Countess', ' Don', ' Rev', ' Ms'], dtype=object)

라벨인코딩해주자!

In [29]:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
sub = le.fit_transform(feature.title)

In [30]:

le.classes_

Out[30]:

array([' Capt', ' Col', ' Don', ' Dona', ' Dr', ' Jonkheer', ' Lady',
       ' Major', ' Master', ' Miss', ' Mlle', ' Mme', ' Mr', ' Mrs',
       ' Ms', ' Rev', ' Sir', ' the Countess'], dtype=object)

==> value_counts()순으로 정렬하고 싶어서, 딕셔너리로 지정해주기!

In [31]:

title_num = {' Miss':1, ' Mr':0, ' Mrs':2, 
             ' Master':3,' Col':3, ' Mme':3, ' Dr':3, ' Major':3,
             ' Capt':3, ' Lady':3, ' Sir':3, ' Mlle':3, ' Dona':3,
             ' Jonkheer':3, ' the Countess':3, ' Don':3, ' Rev':3, ' Ms':3}

In [32]:

feature["title_ec"] = feature.title.map(title_num)

In [33]:

feature.title_ec.isnull().sum()

Out[33]:

In [34]:

feature.title_ec = feature.title_ec.apply(lambda x : int(x))

In [35]:

feature.head(3)

Out[35]:

	pclass	name	gender	age	sibsp	parch	fare	embarked	age_median	title	title_ec
0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	211.3375	S	29.0000	Miss	1
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	151.5500	S	0.9167	Master	3
2	1	Allison, Miss. Helen Loraine	female	2.0000	1	2	151.5500	S	2.0000	Miss	1

이제 title인코딩값으로 나이 대치하기¶

In [36]:

# title에 따른 나이 중앙값
title_age_median = feature.groupby(feature.title_ec).age.transform("median")
title_age_median

Out[36]:

0       22.0
1        9.0
2       22.0
3       29.0
4       35.5
        ... 
1304    22.0
1305    22.0
1306    29.0
1307    29.0
1308    29.0
Name: age, Length: 1309, dtype: float64

In [37]:

# 결측값 채워넣기
feature["age_title_median"] = feature.age.fillna(title_age_median)

In [38]:

# age, name, title 없애기 -> 이후로도 필요없을테니
feature.drop(columns=["age","name","title"],inplace=True)

In [39]:

feature.head()

Out[39]:

	pclass	gender	sibsp	parch	fare	embarked	age_median	title_ec	age_title_median
0	1	female	0	0	211.3375	S	29.0000	1	29.0000
1	1	male	1	2	151.5500	S	0.9167	3	0.9167
2	1	female	1	2	151.5500	S	2.0000	1	2.0000
3	1	male	1	2	151.5500	S	30.0000	0	30.0000
4	1	female	1	2	151.5500	S	25.0000	2	25.0000

7. Impute - Embarked¶

In [40]:

feature.embarked.isnull().sum()

Out[40]:

도시와 pclass 연결이 있다고 생각함
여객선의 경우 부유함의 지표가 될 수 있다고 생각하기에

In [41]:

# 결측값에 있는 pclass 확인하기
feature[feature.embarked.isnull()].pclass

Out[41]:

168    1
284    1
Name: pclass, dtype: int64

In [42]:

plt.figure(figsize=(10,5))
sns.countplot(x="pclass", hue="embarked", palette="Set3",
         data=feature[["pclass","embarked"]], dodge=False)

Out[42]:

<AxesSubplot:xlabel='pclass', ylabel='count'>

pclss 2등급의 경우 S(Southhampton)에서 많이 탄 것을 파악했으므로, S로 대치해주기!

In [43]:

#그냥 S로 바꿔주면 되지면, 그냥 바꾸면 재미없으니~ 스킷런식을 써서 바꿔볼까나?

from sklearn.impute import SimpleImputer

freq_imputer = SimpleImputer(strategy="most_frequent")
feature.embarked = freq_imputer.fit_transform(feature.loc[:,"embarked"].to_frame())

In [44]:

feature.embarked.isnull().sum()

Out[44]:

8. Feature Heatmap 시각화¶

In [45]:

plt.figure(figsize=(7,7))
sns.heatmap(feature.corr(),linewidths=0.1, annot=True, cmap="YlGnBu")

Out[45]:

<AxesSubplot:>

8-1. 시각화 - 성별에 따른 생존자 수¶

In [46]:

sns.countplot(feature.gender, hue=label, palette="Set2" )

Out[46]:

<AxesSubplot:xlabel='gender', ylabel='count'>

8-2 시각화 - 선실 등급에 따른 생존여부¶

In [47]:

sns.kdeplot(feature.pclass, label)

Out[47]:

<AxesSubplot:xlabel='pclass', ylabel='survived'>

9. X/y 분리¶

In [48]:

# 위에서 진행했으므로 패스

10. 데이터 변환(one hot encoding)¶

In [49]:

feature.head(3)

Out[49]:

	pclass	gender	sibsp	parch	fare	embarked	age_median	title_ec	age_title_median
0	1	female	0	0	211.3375	S	29.0000	1	29.0000
1	1	male	1	2	151.5500	S	0.9167	3	0.9167
2	1	female	1	2	151.5500	S	2.0000	1	2.0000

성별[1]과 탑승지[5]는 인코딩 필요함! 이번에는 원핫인코딩 시행하기!

In [51]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer([("ohe", OneHotEncoder(), [1,5])],
                      remainder='passthrough')

X= ct.fit_transform(feature)
print(X)

[[ 1.      0.      0.     ... 29.      1.     29.    ]
 [ 0.      1.      0.     ...  0.9167  3.      0.9167]
 [ 1.      0.      0.     ...  2.      1.      2.    ]
 ...
 [ 0.      1.      1.     ... 26.5     0.     26.5   ]
 [ 0.      1.      1.     ... 27.      0.     27.    ]
 [ 0.      1.      0.     ... 29.      0.     29.    ]]

In [52]:

X[0]

Out[52]:

array([  1.    ,   0.    ,   0.    ,   0.    ,   1.    ,   1.    ,
         0.    ,   0.    , 211.3375,  29.    ,   1.    ,  29.    ])

11. 훈련셋/평가셋 분리¶

In [53]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, label,
                                                   test_size=0.2,
                                                   random_state=42)

In [54]:

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1047, 12)
(262, 12)
(1047,)
(262,)

12. 모델 학습¶

In [55]:

feature

Out[55]:

	pclass	gender	sibsp	parch	fare	embarked	age_median	title_ec	age_title_median
0	1	female	0	0	211.3375	S	29.0000	1	29.0000
1	1	male	1	2	151.5500	S	0.9167	3	0.9167
2	1	female	1	2	151.5500	S	2.0000	1	2.0000
3	1	male	1	2	151.5500	S	30.0000	0	30.0000
4	1	female	1	2	151.5500	S	25.0000	2	25.0000
...	...	...	...	...	...	...	...	...	...
1304	3	female	1	0	14.4542	C	14.5000	1	14.5000
1305	3	female	1	0	14.4542	C	28.0000	1	22.0000
1306	3	male	0	0	7.2250	C	26.5000	0	26.5000
1307	3	male	0	0	7.2250	C	27.0000	0	27.0000
1308	3	male	0	0	7.8750	S	29.0000	0	29.0000

1309 rows × 9 columns

In [56]:

feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   pclass            1309 non-null   int64  
 1   gender            1309 non-null   object 
 2   sibsp             1309 non-null   int64  
 3   parch             1309 non-null   int64  
 4   fare              1309 non-null   float64
 5   embarked          1309 non-null   object 
 6   age_median        1309 non-null   float64
 7   title_ec          1309 non-null   int64  
 8   age_title_median  1309 non-null   float64
dtypes: float64(3), int64(4), object(2)
memory usage: 92.2+ KB

In [57]:

feature.isnull().sum()

Out[57]:

pclass              0
gender              0
sibsp               0
parch               0
fare                0
embarked            0
age_median          0
title_ec            0
age_title_median    0
dtype: int64

In [58]:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

Out[58]:

DecisionTreeClassifier()

13. 모델 성능 확인(evaluate)¶

In [60]:

from sklearn.metrics import accuracy_score, precision_score, recall_score

y_pred = tree.predict(X_test)

acc= accuracy_score(y_test, y_pred)
prec= precision_score(y_test, y_pred)
reca= recall_score(y_test, y_pred)

print(acc)
print(prec)
print(reca)

0.7442748091603053
0.7339449541284404
0.6779661016949152

13-1. confusion matrix 확인¶

In [61]:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues", fmt="g")

Out[61]:

<AxesSubplot:>

😊

728x90

'😆 Big Data > - ML & DL' 카테고리의 다른 글

[ML]📊1. Auto-MPG 데이터 - 단순 회귀 분석하기(Simple Linear Regression) (0)	2022.03.17
[ML]🚶‍♀️Simple purchase data로 머신러닝 (0)	2022.03.16
[ML]🚶‍♀️Simple salary data로 ML warm-up하기 (0)	2022.03.15
[ML] 🤸 5. 피처 엔지니어링 (Feature Engineering) (0)	2022.03.01
[ML] 🤸 4. 머신러닝 알고리즘 평가 (0)	2022.03.01

현재글[ML]🛳️원본 Titanic data로 머신러닝하기

코딩하는 간호사

[ML]🛳️원본 Titanic data로 머신러닝하기

타이타닉 탑승객 생존 예측 Classification with Python¶

1. 데이터 불러오기¶

2. 데이터 확인하기¶

3. 빠진 값 확인¶

4. 사용하지 않을 feature 제거¶

4-1. 먼저 나누기¶

4-2. 필요없는 열 제거¶

5. Impute - Fare¶

6. Impute - Age¶

6-1. 그냥 중앙값으로 대치¶

6-2. 이름에 따른 중앙값으로 대치¶

title¶

이제 title인코딩값으로 나이 대치하기¶

7. Impute - Embarked¶

8. Feature Heatmap 시각화¶

8-1. 시각화 - 성별에 따른 생존자 수¶

8-2 시각화 - 선실 등급에 따른 생존여부¶

9. X/y 분리¶

10. 데이터 변환(one hot encoding)¶

11. 훈련셋/평가셋 분리¶

12. 모델 학습¶

13. 모델 성능 확인(evaluate)¶

13-1. confusion matrix 확인¶

'😆 Big Data > - ML & DL' 카테고리의 다른 글

'😆 Big Data/- ML & DL'의 다른글

티스토리툴바

[ML]🛳️원본 Titanic data로 머신러닝하기

타이타닉 탑승객 생존 예측 Classification with Python¶

1. 데이터 불러오기¶

2. 데이터 확인하기¶

3. 빠진 값 확인¶

4. 사용하지 않을 feature 제거¶

4-1. 먼저 나누기¶

4-2. 필요없는 열 제거¶

5. Impute - Fare¶

6. Impute - Age¶

6-1. 그냥 중앙값으로 대치¶

6-2. 이름에 따른 중앙값으로 대치¶

title¶

이제 title인코딩값으로 나이 대치하기¶

7. Impute - Embarked¶

8. Feature Heatmap 시각화¶

8-1. 시각화 - 성별에 따른 생존자 수¶

8-2 시각화 - 선실 등급에 따른 생존여부¶

9. X/y 분리¶

10. 데이터 변환(one hot encoding)¶

11. 훈련셋/평가셋 분리¶

12. 모델 학습¶

13. 모델 성능 확인(evaluate)¶

13-1. confusion matrix 확인¶

'😆 Big Data > - ML & DL' 카테고리의 다른 글

'😆 Big Data/- ML & DL'의 다른글

관련글

티스토리툴바