😁 빅데이터 문제 풀기 & Study/- 이외 사이트 문제

[Pandas] Pandas 연습 문제 풀기 -7 🐼

또방91 2022. 2. 24. 16:16

728x90

Pandas 연습 문제 풀기 -7 🐼

실습¶

In [1]:

# dataframe
import pandas as pd

# 인구수:population, 땅넓이: area, 수도:capital
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries = countries.set_index('country') # 인덱스 지정
countries

Out[1]:

	population	area	capital
country
Belgium	11.3	30510	Brussels
France	64.3	671308	Paris
Germany	81.3	357050	Berlin
Netherlands	16.9	41526	Amsterdam
United Kingdom	64.9	244820	London

EXERCISE: 인구밀도를 의미하는 `density` 칼럼을 추가하세요. (주의: 현재 'population' 칼럼은 100만 단위로 표기되어 있습니다) (density = 인구/땅넓이)

In [7]:

countries["density"]=(countries.population*1000000)/countries.area

In [8]:

countries

Out[8]:

	population	area	capital	density
country
Belgium	11.3	30510	Brussels	370.370370
France	64.3	671308	Paris	95.783158
Germany	81.3	357050	Berlin	227.699202
Netherlands	16.9	41526	Amsterdam	406.973944
United Kingdom	64.9	244820	London	265.092721

EXERCISE: 인구밀도가 300을 초과하는 국가(country)의 수도(capital)과 인구(population)을 선택해주세요.

In [11]:

countries.loc[countries["density"]>300, ["capital","population"]]

Out[11]:

	capital	population
country
Belgium	Brussels	11.3
Netherlands	Amsterdam	16.9

EXERCISE: 'density_ratio' 칼럼을 추가해주세요. (density_ratio = 인구밀도/평균 인구밀도) 평균인구밀도: 모든 나라의 인구밀도의 평균

In [12]:

countries

Out[12]:

	population	area	capital	density
country
Belgium	11.3	30510	Brussels	370.370370
France	64.3	671308	Paris	95.783158
Germany	81.3	357050	Berlin	227.699202
Netherlands	16.9	41526	Amsterdam	406.973944
United Kingdom	64.9	244820	London	265.092721

In [13]:

# 인구밀도는 density

# 평균인구밀도 : 모든나라의 인구밀도의 평균 
countries.density.mean()

Out[13]:

273.1838790074409

In [15]:

countries['density_ratio'] = countries.density / countries.density.mean()
countries

Out[15]:

	population	area	capital	density	density_ratio
country
Belgium	11.3	30510	Brussels	370.370370	1.355755
France	64.3	671308	Paris	95.783158	0.350618
Germany	81.3	357050	Berlin	227.699202	0.833502
Netherlands	16.9	41526	Amsterdam	406.973944	1.489744
United Kingdom	64.9	244820	London	265.092721	0.970382

EXERCISE: 영국(United Kingdom)의 수도(capital)를 'Cambridge'로 변경해주세요.

In [17]:

countries.loc["United Kingdom","capital"]= 'Cambridge'
countries

Out[17]:

	population	area	capital	density	density_ratio
country
Belgium	11.3	30510	Brussels	370.370370	1.355755
France	64.3	671308	Paris	95.783158	0.350618
Germany	81.3	357050	Berlin	227.699202	0.833502
Netherlands	16.9	41526	Amsterdam	406.973944	1.489744
United Kingdom	64.9	244820	Cambridge	265.092721	0.970382

EXERCISE: 인구 밀도가 100 초과, 300 미만인 국가들을 표시해주세요.

In [19]:

countries.loc[(100<countries.density)&(countries.density<300)]

Out[19]:

	population	area	capital	density	density_ratio
country
Germany	81.3	357050	Berlin	227.699202	0.833502
United Kingdom	64.9	244820	Cambridge	265.092721	0.970382

EXERCISE: 수도가 7글자 이상인 국가들을 표시해주세요. (힌트: string의 len( )를 사용하세요.)

In [25]:

countries[countries.capital.str.len()>=7]

Out[25]:

	population	area	capital	density	density_ratio
country
Belgium	11.3	30510	Brussels	370.370370	1.355755
Netherlands	16.9	41526	Amsterdam	406.973944	1.489744
United Kingdom	64.9	244820	Cambridge	265.092721	0.970382

EXERCISE: 수도에 'am' 이 포함되는 국가들을 표시해주세요. (힌트: string의 contains( )를 사용하세요.)

In [26]:

countries[countries.capital.str.contains("am")]

Out[26]:

	population	area	capital	density	density_ratio
country
Netherlands	16.9	41526	Amsterdam	406.973944	1.489744
United Kingdom	64.9	244820	Cambridge	265.092721	0.970382

실습¶

In [34]:

import os
os.listdir('./data')

Out[34]:

['2014-baby-names-illinois.csv',
 '2015-baby-names-illinois.csv',
 'billboard.csv',
 'country_timeseries.csv',
 'nav_2018.csv',
 'pew.csv',
 'stock price.xlsx',
 'stock valuation.xlsx',
 'tb-raw.csv',
 'titles.csv',
 'weather.csv']

In [35]:

import pandas as pd
titles = pd.read_csv('./data/titles.csv')
titles.head()

Out[35]:

	title	year
0	The Rising Son	1990
1	Ashes of Kukulcan	2016
2	The Thousand Plane Raid	1969
3	Crucea de piatra	1993
4	The 86	2015

EXERCISE: titles에서 가장 빠른 시기에 제작된 영화 두 개를 표시하세요.

In [46]:

titles.sort_values(by="year").head(2)

Out[46]:

	title	year
165182	Miss Jerry	1894
85708	Reproduction of the Corbett and Fitzsimmons Fight	1897

EXERCISE: 제목(title)이 "Hamlet"인 영화는 몇 개가 있나요?

In [69]:

(titles.title=="Hamlet").sum()

Out[69]:

In [93]:

# 동일식
len(titles[titles['title'] == 'Hamlet'])

Out[93]:

EXERCISE: 제목(title)이 "Treasure Island"인 영화를 제작년도(year)의 오름차순에 따라 표시하세요

In [81]:

titles[titles.title=="Treasure Island"].sort_values(by="year")

Out[81]:

	title	year
191379	Treasure Island	1918
47769	Treasure Island	1920
192917	Treasure Island	1934
90175	Treasure Island	1950
104714	Treasure Island	1972
103646	Treasure Island	1973
190792	Treasure Island	1985
166675	Treasure Island	1999

EXERCISE: 1590년에서 1959년 사이 몇 개의 영화가 만들어졌나요?(1950 <= 제작년도 <= 1959)

In [89]:

titles.loc[(titles.year>=1950)&(titles.year<=1959),"title"].count()

Out[89]:

In [92]:

#동일식
len(titles[(titles['year'] >= 1950) & (titles['year'] <= 1959)])

Out[92]:

실습¶

In [32]:

import seaborn as sns

In [33]:

df=sns.load_dataset("titanic")
df

Out[33]:

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

891 rows × 15 columns

타이타닉 데이터셋을 df라는 판다스 데이터프레임으로 불러왔습니다. 이 데이터로 아래 실습을 진행해주세요.

EXERCISE: groupby()를 사용하여 각 성별(sex)의 평균 나이(age)를 구하세요.

In [96]:

df.groupby("sex").age.mean()

Out[96]:

sex
female    27.915709
male      30.726645
Name: age, dtype: float64

In [98]:

df.groupby("sex").mean()

Out[98]:

	survived	pclass	age	sibsp	parch	fare	adult_male	alone
sex
female	0.742038	2.159236	27.915709	0.694268	0.649682	44.479818	0.000000	0.401274
male	0.188908	2.389948	30.726645	0.429809	0.235702	25.523893	0.930676	0.712305

EXERCISE: 전체 승객(passenger)의 평균 생존율을 구하세요. 생존자의 총합 / 탑승자의 총합 'Survived'열을 이용하세요.

생존자총합

In [101]:

# 생존자의 총합 방법1 - 직접 찾기
df.survived.value_counts() # => 342명 알수 있음

Out[101]:

0    549
1    342
Name: survived, dtype: int64

In [103]:

# 생존자의 총합 방법2 - 합계로
df.survived.sum()

Out[103]:

탑승자총합

In [104]:

#탑승자 총합방법1 - 카운트
df.survived.count()

Out[104]:

In [109]:

#탑승자 총합방법2 -길이
len(df.survived)

Out[109]:

In [110]:

# 전체 승객(passenger)의 평균 생존율을 구하세요. 생존자의 총합 / 탑승자의 총합

df.survived.sum() / df.survived.count()

Out[110]:

0.3838383838383838

EXERCISE: 25세 이하 승객의 생존율을 구하세요. (힌트: 불리언 인덱싱)

In [125]:

#변수 설정하기
s25 = df[df.age<=25].survived

s25.sum() / s25.count()

Out[125]:

0.4119601328903654

EXERCISE: 남성의 생존율을 구하세요. 여성의 생존율을 구하세요. (힌트: 불리언 인덱싱)

In [133]:

# 남자 생존률
sm = df[df.sex=="male"].survived   # False0과 True1로 구성

sm.sum() / sm.count()

Out[133]:

0.18890814558058924

In [135]:

# 여자 생존률
sf = df[df.sex=="female"].survived   # False0과 True1로 구성

sf.sum() / sm.count()

Out[135]:

0.4038128249566724

In [152]:

# groupby를 이용해 게산 - 남여 한꺼번에 도출
df[["sex","survived"]].sort_values("sex").value_counts()

Out[152]:

sex     survived
male    0           468
female  1           233
male    1           109
female  0            81
dtype: int64

In [151]:

df.groupby("sex").survived.sum()

Out[151]:

sex
female    233
male      109
Name: survived, dtype: int64

In [144]:

df.groupby("sex").survived.count()

Out[144]:

sex
female    314
male      577
Name: survived, dtype: int64

In [145]:

df.groupby("sex").survived.sum() / df.groupby("sex").survived.count()

Out[145]:

sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

EXERCISE: 생존율을 구하는 함수가 작성되어 있습니다. 성별 생존율을 groupby를 활용해 구하기 위해 함수정의부분을 알맞게 채워주세요.

In [155]:

# survival_ratio함수정의하기
def survival_ratio(x):
    return x.sum() / x.count()


df.groupby('sex')['survived'].agg(survival_ratio)  

Out[155]:

sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

EXERCISE: 'Pclass'별로 생존율을 보기 위해 bar 차트를 그리고자 합니다. ?부분을 알맞게 채워서 bar 차트를 그려주세요. Pclass별 생존률은 위에 정의한 survival_ratio함수를 이용한다.

In [163]:

import matplotlib.pyplot as plt
plt.rc("font", family="Malgun Gothic")


df.groupby("pclass").survived.apply(survival_ratio).plot.bar(rot=0, xlabel="pclass별", ylabel="생존률(%)", title="<pclass 별로 생존률>")

Out[163]:

<AxesSubplot:title={'center':'<pclass 별로 생존률>'}, xlabel='pclass별', ylabel='생존률(%)'>

github 코드 👉 https://github.com/LIMSONA/KDT/blob/main/pandas/Day_4/%EC%97%B0%EC%8A%B507.ipynb

GitHub - LIMSONA/KDT: 한양대학교 PBL K-Digital Trainning

한양대학교 PBL K-Digital Trainning. Contribute to LIMSONA/KDT development by creating an account on GitHub.

github.com

728x90

'😁 빅데이터 문제 풀기 & Study > - 이외 사이트 문제' 카테고리의 다른 글

[Pandas] Pandas 연습 문제 풀기 -9 🐼 (시각화 중심-Seaborn, groupby, pivot_table 등) (0)	2022.02.28
[Pandas] Pandas 연습 문제 풀기 -8 🐼 (결측치, datetime, groupby 등) (0)	2022.02.24
[Pandas] Pandas 연습 문제 풀기 -6 🐼 (0)	2022.02.23
[Pandas] Pandas 연습 문제 풀기 -5 🐼 (0)	2022.02.23
[Pandas] Pandas 연습 문제 풀기 -4 🐼 (0)	2022.02.22

현재글[Pandas] Pandas 연습 문제 풀기 -7 🐼

코딩하는 간호사

[Pandas] Pandas 연습 문제 풀기 -7 🐼

Pandas 연습 문제 풀기 -7 🐼

실습¶

실습¶

실습¶

'😁 빅데이터 문제 풀기 & Study > - 이외 사이트 문제' 카테고리의 다른 글

'😁 빅데이터 문제 풀기 & Study/- 이외 사이트 문제'의 다른글

티스토리툴바

[Pandas] Pandas 연습 문제 풀기 -7 🐼

Pandas 연습 문제 풀기 -7 🐼

실습¶

실습¶

실습¶

'😁 빅데이터 문제 풀기 & Study > - 이외 사이트 문제' 카테고리의 다른 글

'😁 빅데이터 문제 풀기 & Study/- 이외 사이트 문제'의 다른글

관련글

티스토리툴바