머신 러닝 전반적인 소개¶

필수 라이브러리와 도구들¶

NumPy¶

import numpy as np
x=np.array([[1,2,3],[4,5,6]])
print(x)

[[1 2 3]
 [4 5 6]]

SciPy¶

from scipy import sparse

# 대각선 원소는 1이고 나머지는 0인 2차원 Numpy 배여을 만든다.
eye=np.eye(4)
print(eye)

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

# Numpy 배열을 CSR 포맷의 SciPy 희박 행렬로 변환합니다.
# 0이 아닌 원소만 저장됩니다.
sparse_matrix=sparse.csr_matrix(eye)
print("SciPy의 CSR 행렬:\n",sparse_matrix)

SciPy의 CSR 행렬:
   (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0

CSR은 희소행렬을 저장하는 4가지 자료구조중 하나.
CSR: 가로의 순서대로 재정렬하는 방법으로 행에 관여하여 정리 압축

# COO포맷을 이용한 암서와 동일한 희소 행렬을 만드는 예제
data= np.ones(4)
row_indices=np.arange(4)
col_indices=np.arange(4)
eye_coo=sparse.coo_matrix((data,(row_indices,col_indices)))
print(eye_coo)

  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0

COO:(행, 열, 값) 튜플 목록 저장

붓꽃의 품종 분류¶

from sklearn.datasets import load_iris

# iris 데이터를 가지고 온다.
iris_dataset=load_iris()

print('iris_dataset의 키:',iris_dataset.keys(),sep='\n')

iris_dataset의 키:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

print(iris_dataset['DESCR'][:193]+'\n...')

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...

데이터셋에 대한 간략한 설명

print('타깃의 이름:\n',iris_dataset['target_names'])

타깃의 이름:
 ['setosa' 'versicolor' 'virginica']

print('특성의 이름:\n',iris_dataset['feature_names'])

특성의 이름:
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print('data의 타입:\n',type(iris_dataset['data']))

data의 타입:
 <class 'numpy.ndarray'>

print('data의 타입:\n',iris_dataset['data'].shape)

data의 타입:
 (150, 4)

데이터 나누기¶

from sklearn.model_selection import train_test_split

# train, test 데이터를 나눈다.
X_train,X_test,y_train,y_test=train_test_split(
    iris_dataset['data'],iris_dataset['target'],random_state=0)

print('X_train 크기:',X_train.shape)
print('y_train 크기:',y_train.shape)


print('X_test 크기:',X_test.shape)
print('y_test 크기:',y_test.shape)

X_train 크기: (112, 4)
y_train 크기: (112,)
X_test 크기: (38, 4)
y_test 크기: (38,)

train set은 전체의 75%, test set은 전체의 25%로 나눈다.

데이터 살펴보기¶

import pandas as pd
import mglearn

# X_train 데이터를 사용해서 데이터프레임을 만듭니다.
# 열의 이름은 iris_dataset.feature_names에 있는 문자열을 사용합니다.
iris_dataframe= pd.DataFrame(X_train,columns=iris_dataset.feature_names)

# 데이터프레임을 사용해 y_train에 따라 색으로 구분된 산점도 행렬을 만듭니다.
pd.plotting.scatter_matrix(iris_dataframe,c=y_train,figsize=(15,15),marker='o',
                          hist_kwds={'bins':20},s=60,alpha=8,cmap=mglearn.cm3)

array([[<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='sepal width (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='sepal width (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='petal length (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='petal length (cm)'>],
       [<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='sepal width (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='petal length (cm)', ylabel='petal width (cm)'>,
        <AxesSubplot:xlabel='petal width (cm)', ylabel='petal width (cm)'>]],
      dtype=object)

그래프를 통해서 비정상적인 값이나 특이한 값들을 찾을 수도 있다.
데이터 탐색을 하면서 이상한 값을 제거하는 작업을 해준다.

첫 번째 머신러닝 모델: k-최근접 이웃 알고리즘¶

from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=1)

KNN알고리즘에서 k는 가장 가까운 'k개'의 이웃을 찾는다.
그런 다음 이 이웃들의 클래스 중 빈도가 가장 높은 클래스를 예측값으로 사용한다.

예측하기¶

# 새로운 데이터
X_new=np.array([[5,2.9,1,0.2]])
print('X_new.shape:',X_new.shape)

X_new.shape: (1, 4)

prediction=knn.predict(X_new)
print('예측:',prediction)
print('예측한 타깃의 이름:',iris_dataset['target_names'][prediction])

예측: [0]
예측한 타깃의 이름: ['setosa']

모델 평가하기¶

y_pred=knn.predict(X_test)
print('테스트 세트에 대한 예측값:\n',y_pred)

# 테스트 세트의 정확도 계산법1
print('테스트 세트의 정확도:{:.2f}'.format(np.mean(y_pred==y_test)))

# 테스트 세트의 정확도 계산법2
print('테스트 세트의 정확도:{:.2f}'.format(knn.score(X_test,y_test)))

테스트 세트에 대한 예측값:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]
테스트 세트의 정확도:0.97
테스트 세트의 정확도:0.97

정확도 0.97은 모델 테스트 세트에 포함된 붓꽃 중 97%의 품종을 정확히 맞혔다는 뜻.
정확도 계산하는 score 메서드로 모델을 평가.

참고자료

파이선 라이브러리를 활용한 머신러닝/안드레아스 뮐러,세라가이도 지음/ 박해선 옮김/ 한빛미디어

개발자CuCu

머신러닝 소개

머신 러닝 전반적인 소개¶

필수 라이브러리와 도구들¶

NumPy¶

SciPy¶

붓꽃의 품종 분류¶

데이터 나누기¶

데이터 살펴보기¶

첫 번째 머신러닝 모델: k-최근접 이웃 알고리즘¶

예측하기¶

모델 평가하기¶

'데이터 분석 > Machine Learning' 카테고리의 다른 글

+ Recent posts

티스토리툴바