Python data handling¶

CSV¶

CSV, 필드를 쉼표(,)로 구분한 텍스트 파일
엑셀 양식의 데이터를 프로그램에 상관없이 쓰기 위한
데이터형식 이라고생각하면쉬움
탭(TSV), 빈칸(SSV)등으로 구분해서 만들기도 함
통칭하여 character-separated values(CSV) 부름
엑셀에서는“다름 이름 저장” 기능으로 사용가능

csv 파일 읽기¶

line_counter = 0    #파일의 총 줄수를 세는 변수
data_header = []    #data의 필드값을 저장하는 list
customer_list = []  #cutomer 개별 List를 저장하는 List

with open ("./data/customers.csv") as customer_data: #customer.csv 파일을 customer_data 객체에 저장

    while 1:
        data = customer_data.readline() #customer.csv에 한줄씩 data 변수에 저장
        if not data: break   #데이터가 없을 때, Loop 종료
        if line_counter==0:     #첫번째 데이터는 데이터의 필드
            data_header = data.split(",")  #데이터의 필드는 data_header List에 저장, 데이터 저장시 “,”로 분리
        else:
            customer_list.append(data.split(",")) #일반 데이터는 customer_list 객체에 저장, 데이터 저장시 “,”로 분리
        line_counter += 1

print("Header :\t", data_header)   #데이터 필드 값 출력
for i in range(0,10):    #데이터 출력 (샘플 10개만)
    print ("Data",i,":\t\t",customer_list[i])
print (len(customer_list))   #전체 데이터 크기 출력

Header :	 ['customerNumber', 'customerName', 'contactLastName', 'contactFirstName', 'phone', 'addressLine1', 'addressLine2', 'city', 'state', 'postalCode', 'country', 'salesRepEmployeeNumber', 'creditLimit\n']
Data 0 :		 ['103', '"Atelier graphique"', 'Schmitt', '"Carine "', '40.32.2555', '"54', ' rue Royale"', 'NULL', 'Nantes', 'NULL', '44000', 'France', '1370', '21000\n']
Data 1 :		 ['112', '"Signal Gift Stores"', 'King', 'Jean', '7025551838', '"8489 Strong St."', 'NULL', '"Las Vegas"', 'NV', '83030', 'USA', '1166', '71800\n']
Data 2 :		 ['114', '"Australian Collectors', ' Co."', 'Ferguson', 'Peter', '"03 9520 4555"', '"636 St Kilda Road"', '"Level 3"', 'Melbourne', 'Victoria', '3004', 'Australia', '1611', '117300\n']
Data 3 :		 ['119', '"La Rochelle Gifts"', 'Labrune', '"Janine "', '40.67.8555', '"67', ' rue des Cinquante Otages"', 'NULL', 'Nantes', 'NULL', '44000', 'France', '1370', '118200\n']
Data 4 :		 ['121', '"Baane Mini Imports"', 'Bergulfsen', '"Jonas "', '"07-98 9555"', '"Erling Skakkes gate 78"', 'NULL', 'Stavern', 'NULL', '4110', 'Norway', '1504', '81700\n']
Data 5 :		 ['124', '"Mini Gifts Distributors Ltd."', 'Nelson', 'Susan', '4155551450', '"5677 Strong St."', 'NULL', '"San Rafael"', 'CA', '97562', 'USA', '1165', '210500\n']
Data 6 :		 ['125', '"Havel & Zbyszek Co"', 'Piestrzeniewicz', '"Zbyszek "', '"(26) 642-7555"', '"ul. Filtrowa 68"', 'NULL', 'Warszawa', 'NULL', '01-012', 'Poland', 'NULL', '0\n']
Data 7 :		 ['128', '"Blauer See Auto', ' Co."', 'Keitel', 'Roland', '"+49 69 66 90 2555"', '"Lyonerstr. 34"', 'NULL', 'Frankfurt', 'NULL', '60528', 'Germany', '1504', '59700\n']
Data 8 :		 ['129', '"Mini Wheels Co."', 'Murphy', 'Julie', '6505555787', '"5557 North Pendale Street"', 'NULL', '"San Francisco"', 'CA', '94217', 'USA', '1165', '64600\n']
Data 9 :		 ['131', '"Land of Toys Inc."', 'Lee', 'Kwai', '2125557818', '"897 Long Airport Avenue"', 'NULL', 'NYC', 'NY', '10022', 'USA', '1323', '114900\n']
122

csv파일 쓰기¶

line_counter = 0
data_header = []
employee = []
customer_USA_only_list = []
customer = None

with open ("./data/customers.csv", "r") as customer_data:
    while 1:
        data = customer_data.readline()
        if not data:
            break
        if line_counter==0:
            data_header = data.split(",")
        else:
            customer = data.split(",")
            if customer[10].upper() == "USA":  #customer 데이터의 offset 10번째 값
                customer_USA_only_list.append(customer)  #즉 country 필드가 “USA” 것만
        line_counter+=1       #sutomer_USA_only_list에 저장

print ("Header :\t", data_header)
for i in range(0,10):
    print ("Data :\t\t",customer_USA_only_list[i])
print (len(customer_USA_only_list))

with open ("./data/customers_USA_only.csv", "w") as customer_USA_only_csv:
    for customer in customer_USA_only_list:
        customer_USA_only_csv.write(",".join(customer).strip('\n')+"\n")
        #cutomer_USA_only_list 객체에 있는 데이터를 customers_USA_only.csv 파일에 쓰기

Header :	 ['customerNumber', 'customerName', 'contactLastName', 'contactFirstName', 'phone', 'addressLine1', 'addressLine2', 'city', 'state', 'postalCode', 'country', 'salesRepEmployeeNumber', 'creditLimit\n']
Data :		 ['112', '"Signal Gift Stores"', 'King', 'Jean', '7025551838', '"8489 Strong St."', 'NULL', '"Las Vegas"', 'NV', '83030', 'USA', '1166', '71800\n']
Data :		 ['124', '"Mini Gifts Distributors Ltd."', 'Nelson', 'Susan', '4155551450', '"5677 Strong St."', 'NULL', '"San Rafael"', 'CA', '97562', 'USA', '1165', '210500\n']
Data :		 ['129', '"Mini Wheels Co."', 'Murphy', 'Julie', '6505555787', '"5557 North Pendale Street"', 'NULL', '"San Francisco"', 'CA', '94217', 'USA', '1165', '64600\n']
Data :		 ['131', '"Land of Toys Inc."', 'Lee', 'Kwai', '2125557818', '"897 Long Airport Avenue"', 'NULL', 'NYC', 'NY', '10022', 'USA', '1323', '114900\n']
Data :		 ['151', '"Muscle Machine Inc"', 'Young', 'Jeff', '2125557413', '"4092 Furth Circle"', '"Suite 400"', 'NYC', 'NY', '10022', 'USA', '1286', '138500\n']
Data :		 ['157', '"Diecast Classics Inc."', 'Leong', 'Kelvin', '2155551555', '"7586 Pompton St."', 'NULL', 'Allentown', 'PA', '70267', 'USA', '1216', '100600\n']
Data :		 ['161', '"Technics Stores Inc."', 'Hashimoto', 'Juri', '6505556809', '"9408 Furth Circle"', 'NULL', 'Burlingame', 'CA', '94217', 'USA', '1165', '84600\n']
Data :		 ['168', '"American Souvenirs Inc"', 'Franco', 'Keith', '2035557845', '"149 Spinnaker Dr."', '"Suite 101"', '"New Haven"', 'CT', '97823', 'USA', '1286', '0\n']
Data :		 ['173', '"Cambridge Collectables Co."', 'Tseng', 'Jerry', '6175555555', '"4658 Baden Av."', 'NULL', 'Cambridge', 'MA', '51247', 'USA', '1188', '43400\n']
Data :		 ['175', '"Gift Depot Inc."', 'King', 'Julie', '2035552570', '"25593 South Bay Ln."', 'NULL', 'Bridgewater', 'CT', '97562', 'USA', '1323', '84300\n']
34

여기서 문제가 있다. 이렇게 처리시 문장내에 들어가 있는 ,등에 대해서 전처리 과정이 필요하다.

csv객체 활용¶

파이썬에서는 간단히 CSV파일을 처리하기 위해 csv객체를 제공함
예제데이터: korea_floating_population_data.csv (from http://www.data.go.kr)
예제 데이터는 국내 주요 상권의 유동인구 현황 정보
한글로 되어 있어 한글 처리가 필요

import csv
seoung_nam_data = []
header = []
rownum = 0

with open("./data/korea_floating_population_data.csv","r", encoding="cp949") as p_file:
    csv_data = csv.reader(p_file) #csv 객체를 이용해서 csv_data 읽기
    for row in csv_data:	 #읽어온 데이터를  한 줄씩 처리
        if rownum == 0:
            header = row #첫 번째 줄은 데이터 필드로 따로 저장
        location = row[7]
        #“행정구역”필드 데이터 추출, 한글 처리로 유니코드 데이터를 cp949로 변환
        if location.find(u"성남시") != -1:
            seoung_nam_data.append(row)
        #”행정구역” 데이터에 성남시가 들어가 있으면 seoung_nam_data List에 추가
        rownum +=1

with open("./data/seoung_nam_floating_population_data.csv","w", encoding="utf8") as s_p_file:
    writer = csv.writer(s_p_file, delimiter='\t', quotechar="'", quoting=csv.QUOTE_ALL)
    # csv.writer를 사용해서 csv 파일 만들기 delimiter 필드 구분자
    # quotechar는 필드 각 데이터는 묶는 문자, quoting는 묶는 범위
    writer.writerow(header)		 #제목 필드 파일에 쓰기
    for row in seoung_nam_data:
        writer.writerow(row)	 #seoung_nam_data에 있는 정보 list에 쓰기

Web¶

제목, 단락, 링크등 요소 표시를 위해 Tag를 사용
모든 요소들은 꺾쇠 괄호 안에 둘러 쌓여 있음
\<title> Hello, World \<\/title> \＃제목요소, 값은Hello, World
모든 HTML은 트리 모양의 포함관계를 가짐
일반적으로 웹 페이지의 HTML 소스파일은 컴퓨터가 다운로드 받은 후 웹 브라우저가 해석/표시

정규식¶

정규 표현식, regexp 또는 regex등으로 불림
복잡한 문자열 패턴을 정의하는 문자 표현 공식
특정한 규칙을 가진 문자열의 집합을 추출

참고자료

정규식 연습장(https://regexr.com/)

정규표현식 연습¶

re 모듈을 import 하여 사용: import re
함수: search –한개만찾기, findall–전체찾기
추출된 패턴은 tuple로 반환됨
연습 - 특정 페이지에서 ID만 추출하기 https://bit.ly/3rxQFS4
ID 패턴: [영문대소문자|숫자] 여러 개, 별표로 끝남
"([A-Za-z0-9]+***)“ 정규식

import re
import urllib.request

url = "http://goo.gl/U7mSQl"
html = urllib.request.urlopen(url)
html_contents = str(html.read())      # html소스코드
id_results = re.findall(r"([A-Za-z0-9]+\*\*\*)", html_contents)
#findall 전체 찾기, 패턴대로 데이터 찾기

# 5개의 sample만 본다.
for result in id_results[:5]:
    print (result)

codo***
outb7***
dubba4***
multicuspi***
crownm***

import urllib.request # urllib 모듈 호출
import re

url = "http://www.google.com/googlebooks/uspto-patents-grants-text.html" #url 값 입력
html = urllib.request.urlopen(url) # url 열기
html_contents = str(html.read().decode("utf8"))  # html 파일 읽고, 문자열로 변환


# 5개의 sample만 본다
url_list = re.findall(r"(http)(.+)(zip)", html_contents)
for url in url_list[:5]:
    print("".join(url))  # 출력된 Tuple 형태 데이터 str으로 join

http://storage.googleapis.com/patents/grant_full_text/2015/ipg150106.zip
http://storage.googleapis.com/patents/grant_full_text/2015/ipg150113.zip
http://storage.googleapis.com/patents/grant_full_text/2015/ipg150120.zip
http://storage.googleapis.com/patents/grant_full_text/2015/ipg150127.zip
http://storage.googleapis.com/patents/grant_full_text/2015/ipg150203.zip

http로 시작하고 zip으로 끝나는 url를 가지고 오게 된다.

import urllib.request
import re

url = "http://finance.naver.com/item/main.nhn?code=005930"
html = urllib.request.urlopen(url)
html_contents = str(html.read().decode("ms949"))

# 3개의 group이 있어서 튜플로(group1,group2,group3)로 나타난다.
stock_results = re.findall("(\<dl class=\"blind\"\>)([\s\S]+?)(\<\/dl\>)", html_contents) 
samsung_stock = stock_results[0] # 두 개 tuple 값중 첫번째 패턴
samsung_index = samsung_stock[1] # 세 개의 tuple 값중 두 번째 값
                                                  # 하나의 괄호가 tuple index가 됨
index_list= re.findall("(\<dd\>)([\s\S]+?)(\<\/dd\>)", samsung_index)

for index in index_list:
    print (index[1]) # 세 개의 tuple 값중 두 번째 값

2021년 01월 22일 16시 10분 기준 장마감
종목명 삼성전자
종목코드 005930 코스피
현재가 86,800 전일대비 하락 1,300 마이너스 1.48 퍼센트
전일가 88,100
시가 89,000
고가 89,700
상한가 114,500
저가 86,800
하한가 61,700
거래량 30,430,330
거래대금 2,679,425백만

XML¶

데이터의 구조와 의미를 설명하는
TAG(MarkUp)를 사용하여 표시하는 언어
TAG와 TAG사이에 값이표시되고,
구조적인 정보를 표현할 수 있음
HTML과 문법이 비슷, 대표적인 데이터 저장 방식

예시1¶

# 모듈 호출
from bs4 import BeautifulSoup

with open("./data/books.xml", "r", encoding="utf8") as books_file:
    books_xml = books_file.read()  # File을 String으로 읽어오기

# 객체 생성
soup = BeautifulSoup(books_xml, "lxml")
# author가 들어간 모든 element 추출
for book_info in soup.find_all("author"):
    print (book_info)
    print (book_info.get_text())

<author>Carson</author>
Carson
<author>Sungchul</author>
Sungchul

get_text(): 반환된 패턴의 값 반환(태크와 태그 사이)
https://www.crummy.com/software/BeautifulSoup/

예시2¶

미국 특허청(USPTO) 특허 데이터는 XML로 제공됨
해당 데이터중 등록 번호“08621662”인
“Adjustable shoulder device for hard upper torso suit” 분석
XML 데이터를 Beautiful Soup을 통해 데이터 추출
참고: http://www.google.com/patents/US20120260387

import urllib
from bs4 import BeautifulSoup

with open("./data/US08621662-20140107.XML", "r", encoding="utf8") as patent_xml:
    xml = patent_xml.read()  # File을 String으로 읽어오기

soup = BeautifulSoup(xml, "xml")  # xml parser 호출

invention_title_tag = soup.find("invention-title")
print(invention_title_tag.get_text())

Adjustable shoulder device for hard upper torso suit

제목 찾기

publication_reference_tag = soup.find("publication-reference")
p_document_id_tag = publication_reference_tag.find("document-id")
p_country = p_document_id_tag.find("country").get_text()
p_doc_number = p_document_id_tag.find("doc-number").get_text()
p_kind = p_document_id_tag.find("kind").get_text()
p_date = p_document_id_tag.find("date").get_text()

print(p_doc_number, p_kind, p_date)

application_reference_tag = soup.find("application-reference")
a_document_id_tag = publication_reference_tag.find("document-id")
a_country = p_document_id_tag.find("country").get_text()
a_doc_number = p_document_id_tag.find("doc-number").get_text()
a_date = p_document_id_tag.find("date").get_text()

print(a_country, a_doc_number, a_date)

08621662 B2 20140107
US 08621662 20140107

JSON¶

JavaScript Object Notation
원래 웹 언어인 Java Script의 데이터 객체 표현 방식
간결성으로 기계/인간이 모두 이해하기 편함
데이터 용량이 적고, Code로의 전환이 쉬움
이로 인해 XML의 대체제로 많이 활용되고 있음
json모듈을 사용하여 손 쉽게 파싱 및 저장 가능
데이터 저장 및 읽기는 dict type과 상호 호환 가능

Read¶

import json

with open("./data/json_example.json", "r", encoding="utf8") as f:
    contents = f.read()
    json_data =  json.loads(contents)
print(type(json_data))
for employee in json_data["employees"]:
    print(employee)

<class 'dict'>
{'firstName': 'John', 'lastName': 'Doe'}
{'firstName': 'Anna', 'lastName': 'Smith'}
{'firstName': 'Peter', 'lastName': 'Jones'}

json파일을 불러오면 dict타입인걸 확인 할 수 있다.

Write¶

import json

dict_data = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}

with open("./data/data.json", "w") as f:
    json.dump(dict_data, f)

파일을 보면 위와 같이 저장된 것을 확인 할 수 있다.

벡터 (0)	2021.01.25
Numpy part III (0)	2021.01.25
Numpy part II (0)	2021.01.25
Numpy part I (0)	2021.01.25
File & Exception & Log Handling (0)	2021.01.22
Module and Project (0)	2021.01.21
Python Object Oriented Programming (0)	2021.01.21
pythonic code (0)	2021.01.20

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

개발자CuCu

Python data handling

Python data handling¶

CSV¶

csv 파일 읽기¶

csv파일 쓰기¶

csv객체 활용¶

Web¶

정규식¶

정규표현식 연습¶

XML¶

예시1¶

예시2¶

JSON¶

Read¶

Write¶

'AI > 이론' 카테고리의 다른 글

+ Recent posts

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역