pandas II¶

selection & drop¶

import pandas as pd
from pandas import Series
from pandas import DataFrame
import numpy as np

raw_data = {
    "first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
    "last_name": ["Miller", "Jacobson", "Ali", "Milner", "Cooze"],
    "age": [42, 52, 36, 24, 73],
    "city": ["San Francisco", "Baltimore", "Miami", "Douglas", "Boston"],
}
df = pd.DataFrame(raw_data)
df

한개의 column선택¶

df['first_name'].head(3)

0    Jason
1    Molly
2     Tina
Name: first_name, dtype: object

1개 이상의 column선택¶

df[['first_name','last_name','city']]

컬럼이 많을때 보기 편하게¶

df.head(2).T

display(df['age'])
display(df[['age']])

0    42
1    52
2    36
3    24
4    73
Name: age, dtype: int64

컬럼을 하나 뽑을때, []하나만 사용하면 Series로 뽑힌다. []두개 사용하면 DataFrame으로 뽑힌다.

display(df[:3])
display(df['city'][:2])   # Series로 뽑힌다.

0    San Francisco
1        Baltimore
Name: city, dtype: object

column 이름 없이 사용하는 index number는 row 기준

age_series=df['age']
age_series[[0,2,4]]

0    42
2    36
4    73
Name: age, dtype: int64

Series에서 1개 이상의 index사용

age_series[age_series>40]

0    42
1    52
4    73
Name: age, dtype: int64

Boolean index

index 설정¶

display(df.set_index('city'))
display(df)

원본 데이터를 바꾸진 않는다. 저 DataFrame을 쓰고 싶으면 새로 변수를 할당한다.

data drop¶

df.drop(1)

index number로 drop한다.

df.drop([0,2,3])

한개 이상의 Index number로 drop

df.drop('city',axis=1)

axis지정으로 축을 기준으로 drop. 기본은 axis=0행을 의미 한다.

Series Operation¶

s1 = Series(range(1, 6), index=list("abced"))
s1

a    1
b    2
c    3
e    4
d    5
dtype: int64

s2 = Series(range(5, 11), index=list("bcedef"))
s2

b     5
c     6
e     7
d     8
e     9
f    10
dtype: int64

s1+s2

a     NaN
b     7.0
c     9.0
d    13.0
e    11.0
e    13.0
f     NaN
dtype: float64

index으로 기준으로 연산을 수행한다. 이때 겹치는 index가 없을 경우 NaN값으로 반환

DataFrame Operation¶

df1 = DataFrame(np.arange(9).reshape(3, 3), columns=list("abc"))
df1

df2 = DataFrame(np.arange(16).reshape(4, 4), columns=list("abcd"))
df2

df1+df2

(df1+df2).fillna(0)

DataFrame은 column과 index를 모두 고려한다.
NaN값을 처리하기 위해서 fillna()메서드를 사용한다.

Series+DataFrame¶

df = DataFrame(np.arange(16).reshape(4, 4), columns=list("abcd"))
df

s = Series(np.arange(10, 14), index=list("abcd"))
s

a    10
b    11
c    12
d    13
dtype: int32

df+s

brodcasting이 일어나서 연산이 진행된다.

lambda & map & apply¶

map¶

pandas의 series type의 데이터에도 map함수 사용가능
function대신 dict, sequence형 자료등으로 대체가능

s1=Series(np.arange(10))
s1.head(5)

0    0
1    1
2    2
3    3
4    4
dtype: int32

s1.map(lambda x: x**2).head()

0     0
1     1
2     4
3     9
4    16
dtype: int64

z = {1: "A", 2: "B", 3: "C"}
s1.map(z)

0    NaN
1      A
2      B
3      C
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
dtype: object

dict type으로 데이터 교체가 일어난다. 이때 없는 값은 NaN값으로 바뀐다.

s2 = Series(np.arange(10, 30))
s1.map(s2)

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

index가 같은 위치의 값은 s2값으로 바뀐다.

예시¶

df=pd.read_csv('./data/wages.csv')
df.head()

df['sex'].unique()

array(['male', 'female'], dtype=object)

.unique()메서드는 중복 없이 값만 보여준다.

df['sex_code']=df['sex'].map({'male':0,'female':1})
df.head()

def change_sex(x):
    return 0 if x == "male" else 1

df.sex.map(change_sex).head()

0    0
1    1
2    1
3    1
4    1
Name: sex, dtype: int64

replace¶

Map 함수의 기능중 데이터 변환 기능만 담당
데이터 변환시 많이 사용하는 함수

df.sex.replace({"male": 0, "female": 1}).head()

0    0
1    1
2    1
3    1
4    1
Name: sex, dtype: int64

dict type적용

df.sex.replace(["male", "female"], [0, 1], inplace=True)
df.head()

Targe list와 Conversion list를 인자로 사용할 수 있다.
inplace인자는 데이터 변환결과를 직접 적용하겠다는 의미이다.

apply¶

내장 연산 함수를 사용할 때도 똑같은 효과를 거둘 수 있음
mean, std등 사용가능

df_info = df[["earn", "height", "age"]]
df_info.sum()

earn      4.474344e+07
height    9.183125e+04
age       6.250800e+04
dtype: float64

df_info.apply(sum)

earn      4.474344e+07
height    9.183125e+04
age       6.250800e+04
dtype: float64

def f(x):
    return Series(
        [x.min(), x.max(), x.mean(), sum(x.isnull())],
        index=["min", "max", "mean", "null"])

df_info.apply(f)

Scalar값 이외에 Series값의 반환도 가능
apply는 column별로 순회된다고 생각.
x에 들어오는 인자는 DataFrame의 column들이다.

applymap¶

series 단위가 아닌 element 단위로 함수를 적용함
series 단위에 apply를 적용시킬 때와 같은효과

f = lambda x: x // 2
df_info.applymap(f).head(5)

f = lambda x: x ** 2
df_info["earn"].apply(f)

0       6.331592e+09
1       9.292379e+09
2       2.372729e+09
3       6.476724e+09
4       6.738661e+09
            ...     
1374    9.104329e+08
1375    6.176974e+08
1376    1.879825e+08
1377    9.106124e+09
1378    9.168947e+07
Name: earn, Length: 1379, dtype: float64

pandas built-in functions¶

describe¶

Numeric type 데이터의 요약 정보를 보여줌

df = pd.read_csv("data/wages.csv")
df.head()

df.describe()

unique¶

Series data의 유일한 값을 array로 반환

df.race.unique()

array(['white', 'other', 'hispanic', 'black'], dtype=object)

dict(enumerate(df["race"].unique()))

{0: 'white', 1: 'other', 2: 'hispanic', 3: 'black'}

dict type으로 index할 수 있다.

value = list(map(int, np.array(list(enumerate(df["race"].unique())))[:, 0].tolist()))
key = np.array(list(enumerate(df["race"].unique())), dtype=str)[:, 1].tolist()

value, key

([0, 1, 2, 3], ['white', 'other', 'hispanic', 'black'])

array에서 .tolist()메서드는 array를 list로 바꿔준다.

df["race"].replace(to_replace=key, value=value)

0       0
1       0
2       0
3       1
4       0
       ..
1374    0
1375    0
1376    0
1377    0
1378    0
Name: race, Length: 1379, dtype: int64

이 값들은 위와 같이 race컬럼에 있는 값들은 index를 메기고 해당 label로 변환하는데 사용할 수 있다.

sum¶

기본적인 column 또는 row 값의 연산을 지원
sub, mean, min, max, count, median, mad, var 등

# column별
df.sum(axis=0)

earn                                            4.47434e+07
height                                              91831.2
sex       malefemalefemalefemalefemalefemalefemalemalema...
race      whitewhitewhiteotherwhitewhitewhitewhitehispan...
ed                                                    18416
age                                                   62508
dtype: object

# row별
df.sum(axis=1)

0       79710.189011
1       96541.218643
2       48823.436947
3       80652.316153
4       82212.425498
            ...     
1374    30290.060363
1375    25018.829514
1376    13823.311312
1377    95563.664410
1378     9686.681857
Length: 1379, dtype: float64

isnull¶

column 또는row 값의 NaN(null) 값의 index를 반환함

df.isnull().head()

df.isnull().sum()

earn      0
height    0
sex       0
race      0
ed        0
age       0
dtype: int64

Null인 값의 합

sort_values¶

column 값을 기준으로 데이터를 sorting

df.sort_values(["age", "earn"], ascending=True).head()

ascending: 오름차순

Correlation & covariance¶

상관계수와 공분산을 구하는 함수
corr, cov, corrwith

df.age.corr(df.earn)

0.07400349177836055

df.age.cov(df.earn)

36523.6992104089

df.corrwith(df.earn)

earn      1.000000
height    0.291600
ed        0.350374
age       0.074003
dtype: float64

df.corr()

	earn	height	age
min	-98.580489	57.34000	22.000000
max	317949.127955	77.21000	95.000000
mean	32446.292622	66.59264	45.328499
null	0.000000	0.00000	0.000000

	earn	height	age
0	39785.0	36.0	24
1	48198.0	33.0	31
2	24355.0	31.0	16
3	40239.0	31.0	47
4	41044.0	31.0	21

	earn	height	ed	age
count	1379.000000	1379.000000	1379.000000	1379.000000
mean	32446.292622	66.592640	13.354605	45.328499
std	31257.070006	3.818108	2.438741	15.789715
min	-98.580489	57.340000	3.000000	22.000000
25%	10538.790721	63.720000	12.000000	33.000000
50%	26877.870178	66.050000	13.000000	42.000000
75%	44506.215336	69.315000	15.000000	55.000000
max	317949.127955	77.210000	18.000000	95.000000

	earn	height	sex	race	ed	age
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False
3	False	False	False	False	False	False
4	False	False	False	False	False	False

확률론 맛보기 (0)	2021.01.28
Pandas IV (0)	2021.01.28
Pandas III (0)	2021.01.28
딥러닝 학습방법 이해하기 (0)	2021.01.27
Pandas I (0)	2021.01.27
경사하강법 II (0)	2021.01.26
경사하강법 I (0)	2021.01.26
행렬 (0)	2021.01.25

개발자CuCu

Pandas II

pandas II¶

selection & drop¶

한개의 column선택¶

1개 이상의 column선택¶

컬럼이 많을때 보기 편하게¶

index 설정¶

data drop¶

Series Operation¶

DataFrame Operation¶

Series+DataFrame¶

lambda & map & apply¶

map¶

예시¶

replace¶

apply¶

applymap¶

pandas built-in functions¶

describe¶

unique¶

sum¶

isnull¶

sort_values¶

Correlation & covariance¶

'AI > 이론' 카테고리의 다른 글

+ Recent posts

티스토리툴바

	first_name	last_name	age	city
0	Jason	Miller	42	San Francisco
1	Molly	Jacobson	52	Baltimore
2	Tina	Ali	36	Miami
3	Jake	Milner	24	Douglas
4	Amy	Cooze	73	Boston

	0	1
first_name	Jason	Molly
last_name	Miller	Jacobson
age	42	52
city	San Francisco	Baltimore

	earn	height	sex	race	ed	age
0	79571.299011	73.89	male	white	16	49
1	96396.988643	66.23	female	white	16	62
2	48710.666947	63.77	female	white	16	33
3	80478.096153	63.22	female	other	16	95
4	82089.345498	63.08	female	white	17	43

	earn	height	sex	race	ed	age
1038	-56.321979	67.81	male	hispanic	10	22
800	-27.876819	72.29	male	white	12	22
963	-25.655260	68.90	male	white	12	22
1105	988.565070	64.71	female	white	12	22
801	1000.221504	64.09	female	white	12	22

	earn	height	ed	age
earn	1.000000	0.291600	0.350374	0.074003
height	0.291600	1.000000	0.114047	-0.133727
ed	0.350374	0.114047	1.000000	-0.129802
age	0.074003	-0.133727	-0.129802	1.000000

	a	b	c
0	0.0	2.0	4.0
1	7.0	9.0	11.0
2	14.0	16.0	18.0
3	0.0	0.0	0.0