728x90
판다스 공식홈페이지의 매뉴얼 공부를 해보자! 🐼
혹시나 넘파이 공부를 하고 싶다면!!
2022.02.02 - [😀 Language/- Python] - [numpy] 넘파이 튜토리얼 (NumPy quickstart)공부해보기
In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
#티스토리 업로드 원활하게:-)
Pandas (Quickstart Tutorial)¶
1. 기초개념¶
In [2]:
# 필요한 모듈 불러오기
import pandas as pd
import numpy as np
# 시각화 패키지
import matplotlib.pyplot as plt
%matplotlib inline
1) Series¶
In [3]:
# 시리즈
s = pd.Series([1,3,5, np.nan,6,8])
In [4]:
s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
2) Dataframe¶
In [5]:
# 데이터 프레임
dates = pd.date_range('20210221', periods=6)
In [6]:
# np.random.randn은 가우시안 표준편차 난수생성
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns =list("ABCD"))
In [7]:
df
Out[7]:
A | B | C | D | |
---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 |
3) 딕셔너리로 df 만들기¶
In [8]:
df2 = pd.DataFrame({'A':1.,'B':pd.Timestamp('20130102'),'C':pd.Series(1,index=list(range(4)),dtype='float32'),'D':np.array([3]*4,dtype='int32'),'E':pd.Categorical(["test","train","test","train"]), 'F':'foo'})
In [9]:
df2
Out[9]:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
In [10]:
df2. dtypes
Out[10]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
- 데이터프레임의 다른 속성들을 보고 싶은 경우에는
- Jupyter Notebook: df. 입력후 TAB 입력
- Visual Studio Code: df. 입력후 CTRL + SPACE 입력
In [11]:
df2.A
Out[11]:
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
2. 데이터 확인하기¶
1) basic¶
In [12]:
#위 5개 (기본값 5)
df.head()
Out[12]:
A | B | C | D | |
---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 |
In [13]:
#아래 3개
df.tail(3)
Out[13]:
A | B | C | D | |
---|---|---|---|---|
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 |
In [14]:
#인덱스 확인
df.index
Out[14]:
DatetimeIndex(['2021-02-21', '2021-02-22', '2021-02-23', '2021-02-24',
'2021-02-25', '2021-02-26'],
dtype='datetime64[ns]', freq='D')
In [15]:
#열 확인
df.columns
Out[15]:
Index(['A', 'B', 'C', 'D'], dtype='object')
In [16]:
# 값 확인 방법1
df.values
Out[16]:
array([[ 0.39864102, 0.75958625, -0.29130656, -0.75518152],
[-0.71859972, 1.50108912, 1.21736887, 1.16600973],
[-1.75974402, 0.60981169, 0.59163491, -0.2198072 ],
[ 0.71959828, 0.73368645, 0.62174929, 0.76944576],
[ 1.614296 , -0.9079704 , 0.56803932, -0.36125121],
[ 1.10296317, 0.33325801, 0.63136262, -0.13829979]])
In [17]:
# 값 확인 방법2
df.to_numpy()
Out[17]:
array([[ 0.39864102, 0.75958625, -0.29130656, -0.75518152],
[-0.71859972, 1.50108912, 1.21736887, 1.16600973],
[-1.75974402, 0.60981169, 0.59163491, -0.2198072 ],
[ 0.71959828, 0.73368645, 0.62174929, 0.76944576],
[ 1.614296 , -0.9079704 , 0.56803932, -0.36125121],
[ 1.10296317, 0.33325801, 0.63136262, -0.13829979]])
2) 통계수치¶
- count: 데이터 개수
- mean: 평균값
- std: 표준편차
- min: 최소값
- 25%: 1사분위값
- 50%: 중앙값
- 75%: 3사분위값
- max: 최대값
In [18]:
df.describe()
Out[18]:
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.226192 | 0.504910 | 0.556475 | 0.076819 |
std | 1.249426 | 0.793255 | 0.483064 | 0.732729 |
min | -1.759744 | -0.907970 | -0.291307 | -0.755182 |
25% | -0.439290 | 0.402396 | 0.573938 | -0.325890 |
50% | 0.559120 | 0.671749 | 0.606692 | -0.179053 |
75% | 1.007122 | 0.753111 | 0.628959 | 0.542509 |
max | 1.614296 | 1.501089 | 1.217369 | 1.166010 |
In [19]:
# values를 Transpose한 결과를 보여줍니다.
#Transpose는 인덱스를 컬럼으로, 컬럼을 인덱스로 변경
df.T
Out[19]:
2021-02-21 | 2021-02-22 | 2021-02-23 | 2021-02-24 | 2021-02-25 | 2021-02-26 | |
---|---|---|---|---|---|---|
A | 0.398641 | -0.718600 | -1.759744 | 0.719598 | 1.614296 | 1.102963 |
B | 0.759586 | 1.501089 | 0.609812 | 0.733686 | -0.907970 | 0.333258 |
C | -0.291307 | 1.217369 | 0.591635 | 0.621749 | 0.568039 | 0.631363 |
D | -0.755182 | 1.166010 | -0.219807 | 0.769446 | -0.361251 | -0.138300 |
In [20]:
df.transpose()
Out[20]:
2021-02-21 | 2021-02-22 | 2021-02-23 | 2021-02-24 | 2021-02-25 | 2021-02-26 | |
---|---|---|---|---|---|---|
A | 0.398641 | -0.718600 | -1.759744 | 0.719598 | 1.614296 | 1.102963 |
B | 0.759586 | 1.501089 | 0.609812 | 0.733686 | -0.907970 | 0.333258 |
C | -0.291307 | 1.217369 | 0.591635 | 0.621749 | 0.568039 | 0.631363 |
D | -0.755182 | 1.166010 | -0.219807 | 0.769446 | -0.361251 | -0.138300 |
3) 정렬¶
- axis: 축 기준 정보 (0: 인덱스 기준, 1: 컬럼 기준)
- ascending: 정렬 방식 (false : 내림차순, true: 오름차순)
- by: 데이터 정렬에 기준이 되는 컬럼
In [21]:
# sort_index()를 통해 인덱스 또는 컬럼의 이름으로 정렬
# 컬럼기준, 내림차순
df.sort_index(axis=1, ascending= False)
Out[21]:
D | C | B | A | |
---|---|---|---|---|
2021-02-21 | -0.755182 | -0.291307 | 0.759586 | 0.398641 |
2021-02-22 | 1.166010 | 1.217369 | 1.501089 | -0.718600 |
2021-02-23 | -0.219807 | 0.591635 | 0.609812 | -1.759744 |
2021-02-24 | 0.769446 | 0.621749 | 0.733686 | 0.719598 |
2021-02-25 | -0.361251 | 0.568039 | -0.907970 | 1.614296 |
2021-02-26 | -0.138300 | 0.631363 | 0.333258 | 1.102963 |
In [22]:
#sort_values() 를 이용하여 value 값 기준으로 정렬
# 'B' 컬럼 기준으로 정렬
df.sort_values(by='B')
Out[22]:
A | B | C | D | |
---|---|---|---|---|
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 |
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 |
3. 데이터 선택하기¶
1) 데이터 가져오기¶
In [23]:
df['A']
Out[23]:
2021-02-21 0.398641
2021-02-22 -0.718600
2021-02-23 -1.759744
2021-02-24 0.719598
2021-02-25 1.614296
2021-02-26 1.102963
Freq: D, Name: A, dtype: float64
In [24]:
#슬라이싱
df[0:3]
Out[24]:
A | B | C | D | |
---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 |
In [25]:
df['2021-02-21':'2021-02-23']
Out[25]:
A | B | C | D | |
---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 |
2) 이름으로 데이터 가져오기¶
In [26]:
df.loc[dates[0]]
Out[26]:
A 0.398641
B 0.759586
C -0.291307
D -0.755182
Name: 2021-02-21 00:00:00, dtype: float64
In [27]:
df.loc[:,["A","B"]]
Out[27]:
A | B | |
---|---|---|
2021-02-21 | 0.398641 | 0.759586 |
2021-02-22 | -0.718600 | 1.501089 |
2021-02-23 | -1.759744 | 0.609812 |
2021-02-24 | 0.719598 | 0.733686 |
2021-02-25 | 1.614296 | -0.907970 |
2021-02-26 | 1.102963 | 0.333258 |
In [28]:
df.loc["2021-02-21":"2021-02-23",["A","B"]]
Out[28]:
A | B | |
---|---|---|
2021-02-21 | 0.398641 | 0.759586 |
2021-02-22 | -0.718600 | 1.501089 |
2021-02-23 | -1.759744 | 0.609812 |
In [29]:
df.loc["2021-02-21",["A","B"]]
Out[29]:
A 0.398641
B 0.759586
Name: 2021-02-21 00:00:00, dtype: float64
In [30]:
# 인덱스명, 컬럼명을 하나씩 선택하면 스칼라값을 가져올 수 있습니다!
df.loc[dates[0],"A"]
Out[30]:
0.39864102010877045
3)인덱스로 데이터 가져오기¶
In [31]:
df.iloc[3]
Out[31]:
A 0.719598
B 0.733686
C 0.621749
D 0.769446
Name: 2021-02-24 00:00:00, dtype: float64
In [32]:
df.iloc[3:5, 0:2]
Out[32]:
A | B | |
---|---|---|
2021-02-24 | 0.719598 | 0.733686 |
2021-02-25 | 1.614296 | -0.907970 |
In [33]:
df.iloc[[1,2,4], [0,2]]
Out[33]:
A | C | |
---|---|---|
2021-02-22 | -0.718600 | 1.217369 |
2021-02-23 | -1.759744 | 0.591635 |
2021-02-25 | 1.614296 | 0.568039 |
In [34]:
df.iloc[1:3,:]
Out[34]:
A | B | C | D | |
---|---|---|---|---|
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 |
In [35]:
df.iloc[:,1:3]
Out[35]:
B | C | |
---|---|---|
2021-02-21 | 0.759586 | -0.291307 |
2021-02-22 | 1.501089 | 1.217369 |
2021-02-23 | 0.609812 | 0.591635 |
2021-02-24 | 0.733686 | 0.621749 |
2021-02-25 | -0.907970 | 0.568039 |
2021-02-26 | 0.333258 | 0.631363 |
In [36]:
# 1번 행, 1번 컬럼 값 가져오기
df.iloc[1,1]
Out[36]:
1.5010891216734277
In [37]:
# 위와 동일하지만 스칼라값을 가져오는 속도가 .iat이 빠름
df.iat[1,1]
Out[37]:
1.5010891216734277
4) 조건으로 데이터 가져오기¶
In [38]:
df[df["A"] >0 ]
Out[38]:
A | B | C | D | |
---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 |
In [39]:
df [ df>0 ]
Out[39]:
A | B | C | D | |
---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | NaN | NaN |
2021-02-22 | NaN | 1.501089 | 1.217369 | 1.166010 |
2021-02-23 | NaN | 0.609812 | 0.591635 | NaN |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 |
2021-02-25 | 1.614296 | NaN | 0.568039 | NaN |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | NaN |
- isin()을 이용하여 필터링
In [40]:
df2 = df.copy()
df2
Out[40]:
A | B | C | D | |
---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 |
In [41]:
df2["E"]=['one','one', 'two','three','four','three']
df2
Out[41]:
A | B | C | D | E | |
---|---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 | one |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 | one |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 | two |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 | three |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 | four |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 | three |
In [42]:
df2[ df2["E"].isin(['two','four'])]
Out[42]:
A | B | C | D | E | |
---|---|---|---|---|---|
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 | two |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 | four |
5) 데이터 변경하기¶
In [43]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20210221',periods=6))
s1
Out[43]:
2021-02-21 1
2021-02-22 2
2021-02-23 3
2021-02-24 4
2021-02-25 5
2021-02-26 6
Freq: D, dtype: int64
In [44]:
df['F'] = s1
df
Out[44]:
A | B | C | D | F | |
---|---|---|---|---|---|
2021-02-21 | 0.398641 | 0.759586 | -0.291307 | -0.755182 | 1 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 | 2 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 | 3 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 | 4 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 | 5 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 | 6 |
In [45]:
#0번째 인덱스, 'A' 컬럼을 0으로 변경
df.at[dates[0], "A"]=0
df
Out[45]:
A | B | C | D | F | |
---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.759586 | -0.291307 | -0.755182 | 1 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 | 2 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 | 3 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 | 4 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 | 5 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 | 6 |
In [46]:
# 0번째 인덱스, 1번째 컬럼을 0으로 변경
df.iat[0,1] = 0
df
Out[46]:
A | B | C | D | F | |
---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.000000 | -0.291307 | -0.755182 | 1 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 1.166010 | 2 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | -0.219807 | 3 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 0.769446 | 4 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | -0.361251 | 5 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | -0.138300 | 6 |
In [47]:
# 전체 인덱스, 'D' 컬럼 데이터를 변경
df.loc[:,'D'] = np.array([5] * len(df))
df
Out[47]:
A | B | C | D | F | |
---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.000000 | -0.291307 | 5 | 1 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 5 | 2 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | 5 | 3 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 5 | 4 |
2021-02-25 | 1.614296 | -0.907970 | 0.568039 | 5 | 5 |
2021-02-26 | 1.102963 | 0.333258 | 0.631363 | 5 | 6 |
In [48]:
# 기존 DataFrame 복사
df2 = df.copy()
# 0보다 큰 데이터만 음수로 변경
df2[df2 > 0] = -df2
df2
Out[48]:
A | B | C | D | F | |
---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.000000 | -0.291307 | -5 | -1 |
2021-02-22 | -0.718600 | -1.501089 | -1.217369 | -5 | -2 |
2021-02-23 | -1.759744 | -0.609812 | -0.591635 | -5 | -3 |
2021-02-24 | -0.719598 | -0.733686 | -0.621749 | -5 | -4 |
2021-02-25 | -1.614296 | -0.907970 | -0.568039 | -5 | -5 |
2021-02-26 | -1.102963 | -0.333258 | -0.631363 | -5 | -6 |
4. 결측 데이터¶
reindex()
을 통해 컬럼이나 인덱스를 추가하거나, 삭제하거나 변경하는 등의 작업을 진행
In [49]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1
Out[49]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.000000 | -0.291307 | 5 | 1 | NaN |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 5 | 2 | NaN |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | 5 | 3 | NaN |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 5 | 4 | NaN |
In [50]:
df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[50]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.000000 | -0.291307 | 5 | 1 | 1.0 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 5 | 2 | 1.0 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | 5 | 3 | NaN |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 5 | 4 | NaN |
dropna()
를 통해 결측데이터를 삭제(drop)how='any'
는 값들 중 하나라도 NaN인 경우 삭제입니다.how='all'
은 전체가 NaN인 경우 삭제입니다.
In [51]:
df1.dropna(how="any")
Out[51]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-02-21 | 0.0000 | 0.000000 | -0.291307 | 5 | 1 | 1.0 |
2021-02-22 | -0.7186 | 1.501089 | 1.217369 | 5 | 2 | 1.0 |
fillna()
를 통해 결측데이터에 값을 넣을 수도 있습니다.
In [52]:
df1.fillna(value=5)
Out[52]:
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.000000 | -0.291307 | 5 | 1 | 1.0 |
2021-02-22 | -0.718600 | 1.501089 | 1.217369 | 5 | 2 | 1.0 |
2021-02-23 | -1.759744 | 0.609812 | 0.591635 | 5 | 3 | 5.0 |
2021-02-24 | 0.719598 | 0.733686 | 0.621749 | 5 | 4 | 5.0 |
pd.isnull()
을 통해 결측데이터 여부를 Boolean으로 가져올 수 있습니다.
5. 데이터 연산¶
1)통계지표 - Stats¶
In [53]:
df.mean()
Out[53]:
A 0.159752
B 0.378312
C 0.556475
D 5.000000
F 3.500000
dtype: float64
In [54]:
df.mean(1)
Out[54]:
2021-02-21 1.141739
2021-02-22 1.799972
2021-02-23 1.488341
2021-02-24 2.215007
2021-02-25 2.254873
2021-02-26 2.613517
Freq: D, dtype: float64
In [55]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
Out[55]:
2021-02-21 NaN
2021-02-22 NaN
2021-02-23 1.0
2021-02-24 3.0
2021-02-25 5.0
2021-02-26 NaN
Freq: D, dtype: float64
In [57]:
df.sub(s, axis='index')
Out[57]:
A | B | C | D | F | |
---|---|---|---|---|---|
2021-02-21 | NaN | NaN | NaN | NaN | NaN |
2021-02-22 | NaN | NaN | NaN | NaN | NaN |
2021-02-23 | -2.759744 | -0.390188 | -0.408365 | 4.0 | 2.0 |
2021-02-24 | -2.280402 | -2.266314 | -2.378251 | 2.0 | 1.0 |
2021-02-25 | -3.385704 | -5.907970 | -4.431961 | 0.0 | 0.0 |
2021-02-26 | NaN | NaN | NaN | NaN | NaN |
2) 함수 적용하기 ( apply 사용)¶
In [58]:
# 각 컬럼별(기본 axis = 0은 컬럼 기준) 누적합을 구합니다.
df.apply(np.cumsum)
Out[58]:
A | B | C | D | F | |
---|---|---|---|---|---|
2021-02-21 | 0.000000 | 0.000000 | -0.291307 | 5 | 1 |
2021-02-22 | -0.718600 | 1.501089 | 0.926062 | 10 | 3 |
2021-02-23 | -2.478344 | 2.110901 | 1.517697 | 15 | 6 |
2021-02-24 | -1.758745 | 2.844587 | 2.139447 | 20 | 10 |
2021-02-25 | -0.144449 | 1.936617 | 2.707486 | 25 | 15 |
2021-02-26 | 0.958514 | 2.269875 | 3.338848 | 30 | 21 |
In [60]:
# lambda 식을 이용하여 max-min의 값을 구합니다.
df.apply(lambda x: x.max() - x.min())
Out[60]:
A 3.374040
B 2.409060
C 1.508675
D 0.000000
F 5.000000
dtype: float64
3) 히스토그램 Histogramming¶
In [61]:
s = pd.Series(np.random.randint(0,7, size=10))
s
Out[61]:
0 3
1 3
2 6
3 6
4 4
5 2
6 1
7 3
8 1
9 1
dtype: int32
In [62]:
s.value_counts()
Out[62]:
3 3
1 3
6 2
4 1
2 1
dtype: int64
4) 문자열 처리¶
- regular expressions와 Vectorized String Methods 사용하기
In [63]:
s= pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [64]:
s
Out[64]:
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
In [65]:
s.str.lower()
Out[65]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
6. 데이터 합치기¶
1)이어붙이기 concat¶
In [68]:
df = pd.DataFrame(np.random.randn(10,4))
df
Out[68]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -1.314923 | -1.251372 | -0.492488 | 1.800911 |
1 | 0.627679 | 0.098543 | -0.527927 | 2.243325 |
2 | -0.528805 | -2.242872 | 0.316735 | -0.828650 |
3 | -1.738860 | 0.083917 | 1.682947 | 1.069654 |
4 | -0.399194 | -1.224267 | -0.420403 | 1.329004 |
5 | -0.377907 | -1.245100 | 0.115918 | -1.606797 |
6 | 0.667072 | -0.149351 | 0.545612 | 0.998657 |
7 | -1.094363 | -0.354902 | -0.247636 | 0.614892 |
8 | -0.180308 | 0.769481 | 0.878617 | 1.017410 |
9 | -1.174311 | -0.185481 | 0.803302 | -0.495394 |
In [69]:
pieces=[df[:3], df[3:7], df[7:]]
pieces
Out[69]:
[ 0 1 2 3
0 -1.314923 -1.251372 -0.492488 1.800911
1 0.627679 0.098543 -0.527927 2.243325
2 -0.528805 -2.242872 0.316735 -0.828650,
0 1 2 3
3 -1.738860 0.083917 1.682947 1.069654
4 -0.399194 -1.224267 -0.420403 1.329004
5 -0.377907 -1.245100 0.115918 -1.606797
6 0.667072 -0.149351 0.545612 0.998657,
0 1 2 3
7 -1.094363 -0.354902 -0.247636 0.614892
8 -0.180308 0.769481 0.878617 1.017410
9 -1.174311 -0.185481 0.803302 -0.495394]
In [70]:
pd.concat(pieces)
Out[70]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -1.314923 | -1.251372 | -0.492488 | 1.800911 |
1 | 0.627679 | 0.098543 | -0.527927 | 2.243325 |
2 | -0.528805 | -2.242872 | 0.316735 | -0.828650 |
3 | -1.738860 | 0.083917 | 1.682947 | 1.069654 |
4 | -0.399194 | -1.224267 | -0.420403 | 1.329004 |
5 | -0.377907 | -1.245100 | 0.115918 | -1.606797 |
6 | 0.667072 | -0.149351 | 0.545612 | 0.998657 |
7 | -1.094363 | -0.354902 | -0.247636 | 0.614892 |
8 | -0.180308 | 0.769481 | 0.878617 | 1.017410 |
9 | -1.174311 | -0.185481 | 0.803302 | -0.495394 |
2) 조인하기 (merge 함수)¶
- example 1
-> key 구성값이 foo만
In [71]:
left= pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left
Out[71]:
key | lval | |
---|---|---|
0 | foo | 1 |
1 | foo | 2 |
In [72]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right
Out[72]:
key | rval | |
---|---|---|
0 | foo | 4 |
1 | foo | 5 |
In [73]:
pd.merge(left, right, on="key")
Out[73]:
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
- example 2
-> key 구성값이 foo와 bar
In [74]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left
Out[74]:
key | lval | |
---|---|---|
0 | foo | 1 |
1 | bar | 2 |
In [76]:
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
right
Out[76]:
key | rval | |
---|---|---|
0 | foo | 4 |
1 | bar | 5 |
In [77]:
pd.merge(left, right, on='key')
Out[77]:
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | bar | 2 | 5 |
7. 그룹화¶
In [79]:
#예제
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
Out[79]:
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | -0.122736 | 0.921345 |
1 | bar | one | -0.638045 | 1.065631 |
2 | foo | two | 1.490847 | 0.350991 |
3 | bar | three | -1.566600 | 1.404478 |
4 | foo | two | -0.005849 | 1.642857 |
5 | bar | two | -0.652784 | 1.253943 |
6 | foo | one | -1.886499 | 0.381387 |
7 | foo | three | -0.346925 | 0.370780 |
In [80]:
# A 컬럼이 같은 것끼리 묶고, sum()
df.groupby("A").sum()
Out[80]:
C | D | |
---|---|---|
A | ||
bar | -2.857429 | 3.724052 |
foo | -0.871163 | 3.667360 |
In [81]:
# 'A기준으로 묶고'. 'B' 기준으로 다시 묶은 후 sum()
df.groupby(["A","B"]).sum()
Out[81]:
C | D | ||
---|---|---|---|
A | B | ||
bar | one | -0.638045 | 1.065631 |
three | -1.566600 | 1.404478 | |
two | -0.652784 | 1.253943 | |
foo | one | -2.009235 | 1.302732 |
three | -0.346925 | 0.370780 | |
two | 1.484997 | 1.993848 |
8. 데이터 구조 변경하기¶
1) 스택 stack¶
- 모든 데이터들을 인덱스 레벨로 변형합니다. 이를 압축(compresses)한다고 표현
In [83]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
tuples
Out[83]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
- 멀티 인덱스(multiindex) : 말 그대로 index를 한 번에 여러 개 사용하는 방법
In [87]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
type(index)
Out[87]:
pandas.core.indexes.multi.MultiIndex
In [85]:
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
Out[85]:
A | B | ||
---|---|---|---|
first | second | ||
bar | one | -1.343294 | 0.182213 |
two | -2.309475 | 1.002739 | |
baz | one | 0.641963 | 0.320690 |
two | -1.362377 | -0.471617 | |
foo | one | -0.904899 | -0.239375 |
two | -0.037375 | 0.184227 | |
qux | one | -0.322336 | -0.965040 |
two | -1.105936 | -0.589923 |
In [89]:
df2 = df[:4]
df2
Out[89]:
A | B | ||
---|---|---|---|
first | second | ||
bar | one | -1.343294 | 0.182213 |
two | -2.309475 | 1.002739 | |
baz | one | 0.641963 | 0.320690 |
two | -1.362377 | -0.471617 |
In [91]:
stacked = df2.stack()
stacked
Out[91]:
first second
bar one A -1.343294
B 0.182213
two A -2.309475
B 1.002739
baz one A 0.641963
B 0.320690
two A -1.362377
B -0.471617
dtype: float64
In [92]:
#타입 확인하기 ---> 시리즈!!
type(stacked)
Out[92]:
pandas.core.series.Series
- unstack
- “stacked”된 DataFrame 이나 Series 를 원래 형태로 되돌릴 수 있습니다.
- 되돌리는(압축 해제) 것의 레벨을 정할 수 있습니다.
In [96]:
# 레벨 -1 : 원래 값으로 되돌린다.
stacked.unstack()
Out[96]:
A | B | ||
---|---|---|---|
first | second | ||
bar | one | -1.343294 | 0.182213 |
two | -2.309475 | 1.002739 | |
baz | one | 0.641963 | 0.320690 |
two | -1.362377 | -0.471617 |
In [98]:
# 레벨 0은 : first 인덱스가 해제되어 >>> bar와 baz 컬럼이 생김
#인덱스 중 [0]이 first니까
stacked.unstack(0)
Out[98]:
first | bar | baz | |
---|---|---|---|
second | |||
one | A | -1.343294 | 0.641963 |
B | 0.182213 | 0.320690 | |
two | A | -2.309475 | -1.362377 |
B | 1.002739 | -0.471617 |
In [99]:
# 레벨 1 : second 인덱스가 해제되어서 >>> one, two 칼럼이 생김
#인덱스중 [1]이 second니까
stacked.unstack(1)
Out[99]:
second | one | two | |
---|---|---|---|
first | |||
bar | A | -1.343294 | -2.309475 |
B | 0.182213 | 1.002739 | |
baz | A | 0.641963 | -1.362377 |
B | 0.320690 | -0.471617 |
2) 피벗테이블 Pivot¶
: 많은 양의 데이터에서 필요한 자료만을 뽑아 새롭게 표를 작성해 주는 기능입니다
In [100]:
# 예제 df 생성
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})
df
Out[100]:
A | B | C | D | E | |
---|---|---|---|---|---|
0 | one | A | foo | -0.064300 | 0.903790 |
1 | one | B | foo | 1.103717 | -0.163038 |
2 | two | C | foo | 2.142982 | -0.873188 |
3 | three | A | bar | 1.591926 | -0.600806 |
4 | one | B | bar | 1.821011 | -0.261866 |
5 | one | C | bar | -1.450863 | -1.112504 |
6 | two | A | foo | 1.119701 | 1.087266 |
7 | three | B | foo | -1.533084 | 0.476024 |
8 | one | C | foo | -2.067177 | -1.262980 |
9 | one | A | bar | -0.373278 | -0.858273 |
10 | two | B | bar | 0.054464 | -0.593588 |
11 | three | C | bar | -0.120968 | -1.113981 |
In [101]:
pd.pivot_table(df, values="D", index=["A","B"], columns=["C"])
Out[101]:
C | bar | foo | |
---|---|---|---|
A | B | ||
one | A | -0.373278 | -0.064300 |
B | 1.821011 | 1.103717 | |
C | -1.450863 | -2.067177 | |
three | A | 1.591926 | NaN |
B | NaN | -1.533084 | |
C | -0.120968 | NaN | |
two | A | NaN | 1.119701 |
B | 0.054464 | NaN | |
C | NaN | 2.142982 |
9. 시계열 데이터 -->어렵...ㅠ¶
- resample() : 데이터 샘플링하기
In [114]:
import pandas as pd
# 1초 단위로 100개의 index 생성
# 100개 만들거니까 periods=100 / 초단위니까 freq = "S"
rng = pd.date_range("2/22/2022", periods=100, freq="S")
rng
Out[114]:
DatetimeIndex(['2022-02-22 00:00:00', '2022-02-22 00:00:01',
'2022-02-22 00:00:02', '2022-02-22 00:00:03',
'2022-02-22 00:00:04', '2022-02-22 00:00:05',
'2022-02-22 00:00:06', '2022-02-22 00:00:07',
'2022-02-22 00:00:08', '2022-02-22 00:00:09',
'2022-02-22 00:00:10', '2022-02-22 00:00:11',
'2022-02-22 00:00:12', '2022-02-22 00:00:13',
'2022-02-22 00:00:14', '2022-02-22 00:00:15',
'2022-02-22 00:00:16', '2022-02-22 00:00:17',
'2022-02-22 00:00:18', '2022-02-22 00:00:19',
'2022-02-22 00:00:20', '2022-02-22 00:00:21',
'2022-02-22 00:00:22', '2022-02-22 00:00:23',
'2022-02-22 00:00:24', '2022-02-22 00:00:25',
'2022-02-22 00:00:26', '2022-02-22 00:00:27',
'2022-02-22 00:00:28', '2022-02-22 00:00:29',
'2022-02-22 00:00:30', '2022-02-22 00:00:31',
'2022-02-22 00:00:32', '2022-02-22 00:00:33',
'2022-02-22 00:00:34', '2022-02-22 00:00:35',
'2022-02-22 00:00:36', '2022-02-22 00:00:37',
'2022-02-22 00:00:38', '2022-02-22 00:00:39',
'2022-02-22 00:00:40', '2022-02-22 00:00:41',
'2022-02-22 00:00:42', '2022-02-22 00:00:43',
'2022-02-22 00:00:44', '2022-02-22 00:00:45',
'2022-02-22 00:00:46', '2022-02-22 00:00:47',
'2022-02-22 00:00:48', '2022-02-22 00:00:49',
'2022-02-22 00:00:50', '2022-02-22 00:00:51',
'2022-02-22 00:00:52', '2022-02-22 00:00:53',
'2022-02-22 00:00:54', '2022-02-22 00:00:55',
'2022-02-22 00:00:56', '2022-02-22 00:00:57',
'2022-02-22 00:00:58', '2022-02-22 00:00:59',
'2022-02-22 00:01:00', '2022-02-22 00:01:01',
'2022-02-22 00:01:02', '2022-02-22 00:01:03',
'2022-02-22 00:01:04', '2022-02-22 00:01:05',
'2022-02-22 00:01:06', '2022-02-22 00:01:07',
'2022-02-22 00:01:08', '2022-02-22 00:01:09',
'2022-02-22 00:01:10', '2022-02-22 00:01:11',
'2022-02-22 00:01:12', '2022-02-22 00:01:13',
'2022-02-22 00:01:14', '2022-02-22 00:01:15',
'2022-02-22 00:01:16', '2022-02-22 00:01:17',
'2022-02-22 00:01:18', '2022-02-22 00:01:19',
'2022-02-22 00:01:20', '2022-02-22 00:01:21',
'2022-02-22 00:01:22', '2022-02-22 00:01:23',
'2022-02-22 00:01:24', '2022-02-22 00:01:25',
'2022-02-22 00:01:26', '2022-02-22 00:01:27',
'2022-02-22 00:01:28', '2022-02-22 00:01:29',
'2022-02-22 00:01:30', '2022-02-22 00:01:31',
'2022-02-22 00:01:32', '2022-02-22 00:01:33',
'2022-02-22 00:01:34', '2022-02-22 00:01:35',
'2022-02-22 00:01:36', '2022-02-22 00:01:37',
'2022-02-22 00:01:38', '2022-02-22 00:01:39'],
dtype='datetime64[ns]', freq='S')
In [115]:
ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts
Out[115]:
2022-02-22 00:00:00 471
2022-02-22 00:00:01 6
2022-02-22 00:00:02 299
2022-02-22 00:00:03 342
2022-02-22 00:00:04 430
...
2022-02-22 00:01:35 37
2022-02-22 00:01:36 338
2022-02-22 00:01:37 74
2022-02-22 00:01:38 376
2022-02-22 00:01:39 356
Freq: S, Length: 100, dtype: int32
In [118]:
# 5분 단위로 샘플링하여 sum()
ts.resample("5Min").sum()
Out[118]:
2022-02-22 26159
Freq: 5T, dtype: int32
In [120]:
# 다양한 타임존으로 변경가능하다
rng = pd.date_range('2/22/2022 22:22', periods=5, freq='D')
rng
Out[120]:
DatetimeIndex(['2022-02-22 22:22:00', '2022-02-23 22:22:00',
'2022-02-24 22:22:00', '2022-02-25 22:22:00',
'2022-02-26 22:22:00'],
dtype='datetime64[ns]', freq='D')
In [122]:
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
Out[122]:
2022-02-22 22:22:00 0.012582
2022-02-23 22:22:00 1.770963
2022-02-24 22:22:00 -0.168406
2022-02-25 22:22:00 -1.342124
2022-02-26 22:22:00 -1.796392
Freq: D, dtype: float64
In [124]:
# 표준시(UTC)로 변경
ts_utc = ts.tz_localize("UTC")
ts_utc
Out[124]:
2022-02-22 22:22:00+00:00 0.012582
2022-02-23 22:22:00+00:00 1.770963
2022-02-24 22:22:00+00:00 -0.168406
2022-02-25 22:22:00+00:00 -1.342124
2022-02-26 22:22:00+00:00 -1.796392
Freq: D, dtype: float64
In [126]:
# US 동부 시각으로 변경
ts_utc.tz_convert('US/Eastern')
Out[126]:
2022-02-22 17:22:00-05:00 0.012582
2022-02-23 17:22:00-05:00 1.770963
2022-02-24 17:22:00-05:00 -0.168406
2022-02-25 17:22:00-05:00 -1.342124
2022-02-26 17:22:00-05:00 -1.796392
Freq: D, dtype: float64
In [131]:
# 시간 간격(TimeSpan)도 쉽게 표현
rng = pd.date_range("1/1/2022", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
Out[131]:
2022-01-31 -0.433890
2022-02-28 -0.495769
2022-03-31 -1.781772
2022-04-30 -1.031126
2022-05-31 -0.609370
Freq: M, dtype: float64
- 기간(period)과 시간(timestamp) 사이에 산술적인 기능들을 적용가능
- 3시간씩 더하기 예제
In [135]:
#1990~2000 년 분기별
prng = pd.period_range("1990Q1","2000Q4", freq="Q-NOV")
prng
Out[135]:
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
'1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
'1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
'1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
'1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
'1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
'1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
'2000Q3', '2000Q4'],
dtype='period[Q-NOV]')
In [136]:
ts = pd.Series(np.random.randn(len(prng)), prng)
ts
Out[136]:
1990Q1 1.513256
1990Q2 -0.104854
1990Q3 0.716774
1990Q4 -0.984495
1991Q1 0.435139
1991Q2 -0.231029
1991Q3 0.625224
1991Q4 -0.161059
1992Q1 -0.175423
1992Q2 0.372830
1992Q3 -1.842811
1992Q4 -0.002508
1993Q1 -0.665471
1993Q2 0.406707
1993Q3 0.614976
1993Q4 0.535741
1994Q1 0.205205
1994Q2 -0.862236
1994Q3 -1.074272
1994Q4 -0.642100
1995Q1 1.263887
1995Q2 -0.622753
1995Q3 -0.398868
1995Q4 2.750104
1996Q1 -0.968138
1996Q2 0.171290
1996Q3 0.257609
1996Q4 -1.770340
1997Q1 0.277995
1997Q2 -0.189288
1997Q3 1.291642
1997Q4 0.875653
1998Q1 0.445944
1998Q2 -0.123426
1998Q3 -0.365562
1998Q4 0.284759
1999Q1 -0.904108
1999Q2 -0.662248
1999Q3 -0.610624
1999Q4 0.618188
2000Q1 -0.558949
2000Q2 -0.309683
2000Q3 0.430652
2000Q4 -1.173197
Freq: Q-NOV, dtype: float64
In [140]:
#3시간 더하기!
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 3
ts.index
Out[140]:
PeriodIndex(['1990-03-01 03:00', '1990-06-01 03:00', '1990-09-01 03:00',
'1990-12-01 03:00', '1991-03-01 03:00', '1991-06-01 03:00',
'1991-09-01 03:00', '1991-12-01 03:00', '1992-03-01 03:00',
'1992-06-01 03:00', '1992-09-01 03:00', '1992-12-01 03:00',
'1993-03-01 03:00', '1993-06-01 03:00', '1993-09-01 03:00',
'1993-12-01 03:00', '1994-03-01 03:00', '1994-06-01 03:00',
'1994-09-01 03:00', '1994-12-01 03:00', '1995-03-01 03:00',
'1995-06-01 03:00', '1995-09-01 03:00', '1995-12-01 03:00',
'1996-03-01 03:00', '1996-06-01 03:00', '1996-09-01 03:00',
'1996-12-01 03:00', '1997-03-01 03:00', '1997-06-01 03:00',
'1997-09-01 03:00', '1997-12-01 03:00', '1998-03-01 03:00',
'1998-06-01 03:00', '1998-09-01 03:00', '1998-12-01 03:00',
'1999-03-01 03:00', '1999-06-01 03:00', '1999-09-01 03:00',
'1999-12-01 03:00', '2000-03-01 03:00', '2000-06-01 03:00',
'2000-09-01 03:00', '2000-12-01 03:00'],
dtype='period[H]')
10. 범주형 데이터¶
In [153]:
# .astype("category")로 범주형 데이터 타입으로 변환
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'c']})
df
Out[153]:
id | raw_grade | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | b |
3 | 4 | a |
4 | 5 | a |
5 | 6 | c |
In [154]:
df["grade"]= df["raw_grade"].astype("category")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6 non-null int64
1 raw_grade 6 non-null object
2 grade 6 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 362.0+ bytes
In [155]:
df["grade"].unique()
Out[155]:
['a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
In [156]:
# cat.categories 속성을 이용하여 카테고리명을 다시 만들 수 있음
# a-> very good / b-> good / c-> very bad
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
Out[156]:
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | very good |
1 | 2 | b | good |
2 | 3 | b | good |
3 | 4 | a | very good |
4 | 5 | a | very good |
5 | 6 | c | very bad |
In [158]:
# 범주형 데이터를 정렬 : sort_values()
df.sort_values(by="grade")
Out[158]:
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | very good |
3 | 4 | a | very good |
4 | 5 | a | very good |
1 | 2 | b | good |
2 | 3 | b | good |
5 | 6 | c | very bad |
In [164]:
# 그룹화하여 빈도수를 출력하면, 비어있는 범주도 쉽게 확인가능
df.groupby("grade").size()
Out[164]:
grade
very good 3
good 2
very bad 1
dtype: int64
11. 그래프 시각화¶
- 데이터 시각화는 matplotlib API를 사용
In [169]:
# 랜덤 값으로 1000개의 행을 생성합니다.
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2022', periods=1000))
ts
Out[169]:
2022-01-01 0.489242
2022-01-02 -0.927824
2022-01-03 -0.327465
2022-01-04 0.136872
2022-01-05 -0.476714
...
2024-09-22 0.415104
2024-09-23 -0.105425
2024-09-24 2.139681
2024-09-25 0.207584
2024-09-26 -0.559396
Freq: D, Length: 1000, dtype: float64
In [170]:
# 누적합을 구합니다.
ts = ts.cumsum()
ts
Out[170]:
2022-01-01 0.489242
2022-01-02 -0.438582
2022-01-03 -0.766048
2022-01-04 -0.629176
2022-01-05 -1.105890
...
2024-09-22 32.239292
2024-09-23 32.133867
2024-09-24 34.273548
2024-09-25 34.481131
2024-09-26 33.921735
Freq: D, Length: 1000, dtype: float64
In [171]:
#시각화하기
ts.plot()
Out[171]:
<AxesSubplot:>
- DataFrame에서 plot()은 모든 컬럼을 한번에 보여줌
In [172]:
df= pd.DataFrame(np.random.randn(1000,4), index=ts.index, columns=["A","B","C","D"])
df
Out[172]:
A | B | C | D | |
---|---|---|---|---|
2022-01-01 | -0.209252 | 1.102672 | 0.924835 | -0.949641 |
2022-01-02 | -0.155422 | -1.550508 | 1.924743 | 0.361616 |
2022-01-03 | 0.077475 | 0.427442 | -1.639438 | -0.499659 |
2022-01-04 | -1.415768 | 0.353553 | 0.765075 | 0.043488 |
2022-01-05 | 1.959414 | 1.765033 | 0.334230 | -1.449789 |
... | ... | ... | ... | ... |
2024-09-22 | -0.867605 | 0.874456 | -0.985099 | 1.245273 |
2024-09-23 | -0.343630 | -0.243843 | 1.700093 | -0.032175 |
2024-09-24 | 0.806588 | 1.955295 | 1.493937 | 0.254980 |
2024-09-25 | 0.150937 | 0.186729 | -0.020129 | 1.306353 |
2024-09-26 | -0.832717 | -0.294828 | 2.045209 | 0.321856 |
1000 rows × 4 columns
In [173]:
#누적합
df= df.cumsum()
df
Out[173]:
A | B | C | D | |
---|---|---|---|---|
2022-01-01 | -0.209252 | 1.102672 | 0.924835 | -0.949641 |
2022-01-02 | -0.364674 | -0.447835 | 2.849578 | -0.588026 |
2022-01-03 | -0.287199 | -0.020393 | 1.210140 | -1.087685 |
2022-01-04 | -1.702966 | 0.333160 | 1.975215 | -1.044197 |
2022-01-05 | 0.256447 | 2.098193 | 2.309444 | -2.493986 |
... | ... | ... | ... | ... |
2024-09-22 | -11.370808 | 24.611151 | -14.216113 | -33.472528 |
2024-09-23 | -11.714438 | 24.367309 | -12.516019 | -33.504702 |
2024-09-24 | -10.907850 | 26.322604 | -11.022082 | -33.249722 |
2024-09-25 | -10.756912 | 26.509333 | -11.042212 | -31.943369 |
2024-09-26 | -11.589629 | 26.214505 | -8.997002 | -31.621513 |
1000 rows × 4 columns
In [174]:
plt.figure()
df.plot()
plt.legend(loc="best")
Out[174]:
<matplotlib.legend.Legend at 0x1e6ef3641c0>
<Figure size 432x288 with 0 Axes>
In [176]:
#동일식
df.plot(legend="best")
Out[176]:
<AxesSubplot:>
12. 파일입출력¶
1) CSV¶
- .to_csv()로 DataFrame을 쉽게 csv 파일로 내보내기(저장)
- pd.read_csv()로 csv파일을 DataFrame으로 읽기
2) HDFS¶
- to_hdf()로 HDF5 형식으로 내보내기(저장)
- pd.read_hdf()로 hdf5 형식을 DataFrame으로 읽기
3) Excel¶
- to_excel()로 xlsx 파일 형식으로 내보내기(저장)
- pd.read_excel()로 xlsx 파일로부터 DataFrame으로 읽기
위 코드식 저장은 github 요기로! 👉 https://github.com/LIMSONA/Big_Data_Study/blob/main/Pandas%20(Quickstart%20Tutorial).ipynb
728x90
'😀 Language > - Python' 카테고리의 다른 글
[샛길공부] join 문자열 리스트 합치기, 공백 구분 추가하기, (feat. 백준 문제 풀 때 유용하더라) (0) | 2022.07.04 |
---|---|
[Pandas] EDA 자주 사용하는 코드 모아보기 😆 (0) | 2022.02.24 |
[numpy] 넘파이 튜토리얼 (NumPy quickstart)공부해보기 (0) | 2022.02.02 |
[파이썬] 주사위 30개를 10000번 던졌을 때 정규분포표 그리기 (+중심극한) (0) | 2022.01.25 |
[샛길공부] int( )함수 자세히 알아보기! int(값,진수) (그의 매력을 파헤쳐보자) (0) | 2022.01.22 |