728x90
kaggle이나 Seaborn에서 Titanic 데이터를 많이 접해봤을 것이다!
하지만 이 것들은 다 편집본 데이터!
원본 titanic 데이터를 얻었기에, 또 EDA를 멋드러지게 해봐야지!
보다보면 컬럼명이 무시무시한 것도 있다.... body... 번호같은...
타이타닉 탑승객 생존 예측 Classification with Python¶
In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings(action='ignore')
In [2]:
os.listdir()
Out[2]:
['01SR_Data.csv',
'02.Classification_with_Python.ipynb',
'03.Classification_with_scikitlearn(Titanic).ipynb',
'.ipynb_checkpoints',
'01.Regression_with_Python.ipynb',
'03Titanic_dataset.csv',
'02Social_Network_Ads.csv',
'Pandas_Slicing_Practice.ipynb']
1. 데이터 불러오기¶
In [3]:
df = pd.read_csv("03Titanic_dataset.csv")
df.head()
Out[3]:
pclass | survived | name | gender | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
2. 데이터 확인하기¶
In [4]:
df.columns
Out[4]:
Index(['pclass', 'survived', 'name', 'gender', 'age', 'sibsp', 'parch',
'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
dtype='object')
In [5]:
df.shape
Out[5]:
(1309, 14)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64
2 name 1309 non-null object
3 gender 1309 non-null object
4 age 1046 non-null float64
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 ticket 1309 non-null object
8 fare 1308 non-null float64
9 cabin 295 non-null object
10 embarked 1307 non-null object
11 boat 486 non-null object
12 body 121 non-null float64
13 home.dest 745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB
3. 빠진 값 확인¶
In [7]:
df.isnull().sum()
Out[7]:
pclass 0
survived 0
name 0
gender 0
age 263
sibsp 0
parch 0
ticket 0
fare 1
cabin 1014
embarked 2
boat 823
body 1188
home.dest 564
dtype: int64
4. 사용하지 않을 feature 제거¶
4-1. 먼저 나누기¶
In [8]:
df.head(1)
Out[8]:
pclass | survived | name | gender | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
In [9]:
label = df.survived
feature = df.drop(columns="survived", axis=1)
In [10]:
feature.head()
Out[10]:
pclass | name | gender | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
4-2. 필요없는 열 제거¶
- 결측값이 많고 무의미한 컬럼은 삭제
- 결측값 반절정도인 경우 삭제고려: cabin, boat,body,home.dest
- cabin의 경우 생존과 관련이 있지만, 배구조 파악이 힘들어 삭제
- boat 번호보단 생존여부가 더 중요하므로 삭제
- 생존여부를 survived로 확인가능하니 body 삭제
- home.dest 승하차 도시로부터 부의 도시유무로 생존영향성 평가 도출 고려해보겠으나, 그보단 pclass나 cabin이 보다 연관성있게 찾을 수 있을거라 생각되어 삭제
==> cabin, boat,body,home.dest 삭제
In [11]:
feature.shape
Out[11]:
(1309, 13)
In [12]:
# 적용 전 drop식 확인
feature.drop(columns=["cabin","boat","body","home.dest"], axis=1).head()
Out[12]:
pclass | name | gender | age | sibsp | parch | ticket | fare | embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | S |
1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | S |
2 | 1 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | S |
3 | 1 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | S |
4 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | S |
In [13]:
feature = feature.drop(columns=["cabin","boat","body","home.dest"], axis=1)
# 삭제 잘되었는지 확인
feature.shape
Out[13]:
(1309, 9)
필요없는 열 삭제
In [14]:
# ticket 삭제
feature.drop(columns="ticket", inplace=True)
5. Impute - Fare¶
결측값은 pclass의 median 값으로 대치할 생각임
In [15]:
feature.groupby(feature.pclass).fare.median()
Out[15]:
pclass
1 60.0000
2 15.0458
3 8.0500
Name: fare, dtype: float64
In [16]:
feature.fare.fillna(feature.groupby(feature.pclass).fare.transform("median"),inplace=True)
In [17]:
feature.fare.isnull().sum()
Out[17]:
0
In [18]:
sns.distplot(df.fare)
Out[18]:
<AxesSubplot:xlabel='fare', ylabel='Density'>
6. Impute - Age¶
In [19]:
feature.age.isnull().sum()
Out[19]:
263
In [20]:
feature.sample(5)
Out[20]:
pclass | name | gender | age | sibsp | parch | fare | embarked | |
---|---|---|---|---|---|---|---|---|
658 | 3 | Baclini, Miss. Helene Barbara | female | 0.75 | 2 | 1 | 19.2583 | C |
1270 | 3 | Vande Walle, Mr. Nestor Cyriel | male | 28.00 | 0 | 0 | 9.5000 | S |
200 | 1 | McCaffry, Mr. Thomas Francis | male | 46.00 | 0 | 0 | 75.2417 | C |
733 | 3 | Coutts, Master. Eden Leslie "Neville" | male | 9.00 | 1 | 1 | 15.9000 | S |
652 | 3 | Augustsson, Mr. Albert | male | 23.00 | 0 | 0 | 7.8542 | S |
6-1. 그냥 중앙값으로 대치¶
In [21]:
from sklearn.impute import SimpleImputer
median_imputer = SimpleImputer(strategy="median")
In [22]:
feature["age_median"] = median_imputer.fit_transform(feature.iloc[:,3:4])
In [23]:
feature.age_median.isnull().sum()
Out[23]:
0
In [24]:
feature.head()
Out[24]:
pclass | name | gender | age | sibsp | parch | fare | embarked | age_median | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 211.3375 | S | 29.0000 |
1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 151.5500 | S | 0.9167 |
2 | 1 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 151.5500 | S | 2.0000 |
3 | 1 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 151.5500 | S | 30.0000 |
4 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 151.5500 | S | 25.0000 |
6-2. 이름에 따른 중앙값으로 대치¶
이름에 "Mr","Miss","Mrs"가 들어간다는 것을 보고 생각함. 라벨 인코딩처럼 "Mr": 0, "Miss": 1, "Mrs": 2, 이외: 3으로 붙이기로 정함
title¶
In [25]:
title = list()
for x in feature.name:
result = x.split(",")[1].split('.')[0]
title.append(result)
In [26]:
feature["title"] = title
feature.head(1)
Out[26]:
pclass | name | gender | age | sibsp | parch | fare | embarked | age_median | title | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0 | 0 | 0 | 211.3375 | S | 29.0 | Miss |
In [27]:
feature.title.value_counts()
Out[27]:
Mr 757
Miss 260
Mrs 197
Master 61
Rev 8
Dr 8
Col 4
Mlle 2
Ms 2
Major 2
Capt 1
Sir 1
Dona 1
Jonkheer 1
the Countess 1
Don 1
Mme 1
Lady 1
Name: title, dtype: int64
In [28]:
feature.title.unique()
Out[28]:
array([' Miss', ' Master', ' Mr', ' Mrs', ' Col', ' Mme', ' Dr', ' Major',
' Capt', ' Lady', ' Sir', ' Mlle', ' Dona', ' Jonkheer',
' the Countess', ' Don', ' Rev', ' Ms'], dtype=object)
라벨인코딩해주자!
In [29]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
sub = le.fit_transform(feature.title)
In [30]:
le.classes_
Out[30]:
array([' Capt', ' Col', ' Don', ' Dona', ' Dr', ' Jonkheer', ' Lady',
' Major', ' Master', ' Miss', ' Mlle', ' Mme', ' Mr', ' Mrs',
' Ms', ' Rev', ' Sir', ' the Countess'], dtype=object)
==> value_counts()순으로 정렬하고 싶어서, 딕셔너리로 지정해주기!
In [31]:
title_num = {' Miss':1, ' Mr':0, ' Mrs':2,
' Master':3,' Col':3, ' Mme':3, ' Dr':3, ' Major':3,
' Capt':3, ' Lady':3, ' Sir':3, ' Mlle':3, ' Dona':3,
' Jonkheer':3, ' the Countess':3, ' Don':3, ' Rev':3, ' Ms':3}
In [32]:
feature["title_ec"] = feature.title.map(title_num)
In [33]:
feature.title_ec.isnull().sum()
Out[33]:
0
In [34]:
feature.title_ec = feature.title_ec.apply(lambda x : int(x))
In [35]:
feature.head(3)
Out[35]:
pclass | name | gender | age | sibsp | parch | fare | embarked | age_median | title | title_ec | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 211.3375 | S | 29.0000 | Miss | 1 |
1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 151.5500 | S | 0.9167 | Master | 3 |
2 | 1 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 151.5500 | S | 2.0000 | Miss | 1 |
이제 title인코딩값으로 나이 대치하기¶
In [36]:
# title에 따른 나이 중앙값
title_age_median = feature.groupby(feature.title_ec).age.transform("median")
title_age_median
Out[36]:
0 22.0
1 9.0
2 22.0
3 29.0
4 35.5
...
1304 22.0
1305 22.0
1306 29.0
1307 29.0
1308 29.0
Name: age, Length: 1309, dtype: float64
In [37]:
# 결측값 채워넣기
feature["age_title_median"] = feature.age.fillna(title_age_median)
In [38]:
# age, name, title 없애기 -> 이후로도 필요없을테니
feature.drop(columns=["age","name","title"],inplace=True)
In [39]:
feature.head()
Out[39]:
pclass | gender | sibsp | parch | fare | embarked | age_median | title_ec | age_title_median | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | 0 | 0 | 211.3375 | S | 29.0000 | 1 | 29.0000 |
1 | 1 | male | 1 | 2 | 151.5500 | S | 0.9167 | 3 | 0.9167 |
2 | 1 | female | 1 | 2 | 151.5500 | S | 2.0000 | 1 | 2.0000 |
3 | 1 | male | 1 | 2 | 151.5500 | S | 30.0000 | 0 | 30.0000 |
4 | 1 | female | 1 | 2 | 151.5500 | S | 25.0000 | 2 | 25.0000 |
7. Impute - Embarked¶
In [40]:
feature.embarked.isnull().sum()
Out[40]:
2
- 도시와 pclass 연결이 있다고 생각함
- 여객선의 경우 부유함의 지표가 될 수 있다고 생각하기에
In [41]:
# 결측값에 있는 pclass 확인하기
feature[feature.embarked.isnull()].pclass
Out[41]:
168 1
284 1
Name: pclass, dtype: int64
In [42]:
plt.figure(figsize=(10,5))
sns.countplot(x="pclass", hue="embarked", palette="Set3",
data=feature[["pclass","embarked"]], dodge=False)
Out[42]:
<AxesSubplot:xlabel='pclass', ylabel='count'>
- pclss 2등급의 경우 S(Southhampton)에서 많이 탄 것을 파악했으므로, S로 대치해주기!
In [43]:
#그냥 S로 바꿔주면 되지면, 그냥 바꾸면 재미없으니~ 스킷런식을 써서 바꿔볼까나?
from sklearn.impute import SimpleImputer
freq_imputer = SimpleImputer(strategy="most_frequent")
feature.embarked = freq_imputer.fit_transform(feature.loc[:,"embarked"].to_frame())
In [44]:
feature.embarked.isnull().sum()
Out[44]:
0
8. Feature Heatmap 시각화¶
In [45]:
plt.figure(figsize=(7,7))
sns.heatmap(feature.corr(),linewidths=0.1, annot=True, cmap="YlGnBu")
Out[45]:
<AxesSubplot:>
8-1. 시각화 - 성별에 따른 생존자 수¶
In [46]:
sns.countplot(feature.gender, hue=label, palette="Set2" )
Out[46]:
<AxesSubplot:xlabel='gender', ylabel='count'>
8-2 시각화 - 선실 등급에 따른 생존여부¶
In [47]:
sns.kdeplot(feature.pclass, label)
Out[47]:
<AxesSubplot:xlabel='pclass', ylabel='survived'>
9. X/y 분리¶
In [48]:
# 위에서 진행했으므로 패스
10. 데이터 변환(one hot encoding)¶
In [49]:
feature.head(3)
Out[49]:
pclass | gender | sibsp | parch | fare | embarked | age_median | title_ec | age_title_median | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | 0 | 0 | 211.3375 | S | 29.0000 | 1 | 29.0000 |
1 | 1 | male | 1 | 2 | 151.5500 | S | 0.9167 | 3 | 0.9167 |
2 | 1 | female | 1 | 2 | 151.5500 | S | 2.0000 | 1 | 2.0000 |
성별[1]과 탑승지[5]는 인코딩 필요함! 이번에는 원핫인코딩 시행하기!
In [51]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([("ohe", OneHotEncoder(), [1,5])],
remainder='passthrough')
X= ct.fit_transform(feature)
print(X)
[[ 1. 0. 0. ... 29. 1. 29. ]
[ 0. 1. 0. ... 0.9167 3. 0.9167]
[ 1. 0. 0. ... 2. 1. 2. ]
...
[ 0. 1. 1. ... 26.5 0. 26.5 ]
[ 0. 1. 1. ... 27. 0. 27. ]
[ 0. 1. 0. ... 29. 0. 29. ]]
In [52]:
X[0]
Out[52]:
array([ 1. , 0. , 0. , 0. , 1. , 1. ,
0. , 0. , 211.3375, 29. , 1. , 29. ])
11. 훈련셋/평가셋 분리¶
In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, label,
test_size=0.2,
random_state=42)
In [54]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(1047, 12)
(262, 12)
(1047,)
(262,)
12. 모델 학습¶
In [55]:
feature
Out[55]:
pclass | gender | sibsp | parch | fare | embarked | age_median | title_ec | age_title_median | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | 0 | 0 | 211.3375 | S | 29.0000 | 1 | 29.0000 |
1 | 1 | male | 1 | 2 | 151.5500 | S | 0.9167 | 3 | 0.9167 |
2 | 1 | female | 1 | 2 | 151.5500 | S | 2.0000 | 1 | 2.0000 |
3 | 1 | male | 1 | 2 | 151.5500 | S | 30.0000 | 0 | 30.0000 |
4 | 1 | female | 1 | 2 | 151.5500 | S | 25.0000 | 2 | 25.0000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1304 | 3 | female | 1 | 0 | 14.4542 | C | 14.5000 | 1 | 14.5000 |
1305 | 3 | female | 1 | 0 | 14.4542 | C | 28.0000 | 1 | 22.0000 |
1306 | 3 | male | 0 | 0 | 7.2250 | C | 26.5000 | 0 | 26.5000 |
1307 | 3 | male | 0 | 0 | 7.2250 | C | 27.0000 | 0 | 27.0000 |
1308 | 3 | male | 0 | 0 | 7.8750 | S | 29.0000 | 0 | 29.0000 |
1309 rows × 9 columns
In [56]:
feature.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 gender 1309 non-null object
2 sibsp 1309 non-null int64
3 parch 1309 non-null int64
4 fare 1309 non-null float64
5 embarked 1309 non-null object
6 age_median 1309 non-null float64
7 title_ec 1309 non-null int64
8 age_title_median 1309 non-null float64
dtypes: float64(3), int64(4), object(2)
memory usage: 92.2+ KB
In [57]:
feature.isnull().sum()
Out[57]:
pclass 0
gender 0
sibsp 0
parch 0
fare 0
embarked 0
age_median 0
title_ec 0
age_title_median 0
dtype: int64
In [58]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
Out[58]:
DecisionTreeClassifier()
13. 모델 성능 확인(evaluate)¶
In [60]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
y_pred = tree.predict(X_test)
acc= accuracy_score(y_test, y_pred)
prec= precision_score(y_test, y_pred)
reca= recall_score(y_test, y_pred)
print(acc)
print(prec)
print(reca)
0.7442748091603053
0.7339449541284404
0.6779661016949152
13-1. confusion matrix 확인¶
In [61]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues", fmt="g")
Out[61]:
<AxesSubplot:>
😊
728x90
'😆 Big Data > - ML & DL' 카테고리의 다른 글
[ML]📊1. Auto-MPG 데이터 - 단순 회귀 분석하기(Simple Linear Regression) (0) | 2022.03.17 |
---|---|
[ML]🚶♀️Simple purchase data로 머신러닝 (0) | 2022.03.16 |
[ML]🚶♀️Simple salary data로 ML warm-up하기 (0) | 2022.03.15 |
[ML] 🤸 5. 피처 엔지니어링 (Feature Engineering) (0) | 2022.03.01 |
[ML] 🤸 4. 머신러닝 알고리즘 평가 (0) | 2022.03.01 |