Python으로 UTF8 CSV 파일 읽기

code

Python으로 UTF8 CSV 파일 읽기

codestyles 2020. 9. 9. 08:02

Python으로 UTF8 CSV 파일 읽기

Python (프랑스어 및 / 또는 스페인어 문자 만 해당)으로 악센트 부호가있는 문자가있는 CSV 파일을 읽으려고합니다. csvreader ( http://docs.python.org/library/csv.html ) 에 대한 Python 2.5 문서를 기반으로 csvreader가 ASCII 만 지원하므로 CSV 파일을 읽기 위해 다음 코드를 작성했습니다.

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

filename = 'output.csv'
reader = unicode_csv_reader(open(filename))
try:
    products = []
    for field1, field2, field3 in reader:
        ...

아래는 내가 읽으려고하는 CSV 파일의 발췌 본입니다.

0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu
0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris
0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert
...

UTF-8로 인코딩 / 디코딩을 시도해도 여전히 다음 예외가 발생합니다.

Traceback (most recent call last):
  File ".\Test.py", line 53, in <module>
    for field1, field2, field3 in reader:
  File ".\Test.py", line 40, in unicode_csv_reader
    for row in csv_reader:
  File ".\Test.py", line 46, in utf_8_encoder
    yield line.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)

이 문제를 어떻게 해결합니까?

이 .encode메서드는 유니 코드 문자열에 적용되어 바이트 문자열을 만듭니다. 하지만 대신 바이트 문자열로 호출하고 있습니다. codecs표준 라이브러리 의 모듈을 살펴보고 codecs.open특히 UTF-8로 인코딩 된 텍스트 파일을 읽기위한 더 나은 일반 솔루션을 찾으십시오 . 그러나 csv특히 모듈의 경우 utf-8 데이터를 전달해야하며 이것이 이미 얻은 것이므로 코드가 훨씬 간단해질 수 있습니다.

import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

filename = 'da.csv'
reader = unicode_csv_reader(open(filename))
for field1, field2, field3 in reader:
  print field1, field2, field3

PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv module level), of the form line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the yield line in my code above, instead of 'utf-8', as csv is actually going to be just fine with ISO-8859-* encoded bytestrings.

Python 2.X

There is a unicode-csv library which should solve your problems, with added benefit of not naving to write any new csv-related code.

Here is a example from their readme:

>>> import unicodecsv
>>> from cStringIO import StringIO
>>> f = StringIO()
>>> w = unicodecsv.writer(f, encoding='utf-8')
>>> w.writerow((u'é', u'ñ'))
>>> f.seek(0)
>>> r = unicodecsv.reader(f, encoding='utf-8')
>>> row = r.next()
>>> print row[0], row[1]
é ñ

Python 3.X

In python 3 this is supported out of the box by the build-in csv module. See this example:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

If you want to read a CSV File with encoding utf-8, a minimalistic approach that I recommend you is to use something like this:

        with open(file_name, encoding="utf8") as csv_file:

With that statement, you can use later a CSV reader to work with.

Also checkout the answer in this post: https://stackoverflow.com/a/9347871/1338557

It suggests use of library called ucsv.py. Short and simple replacement for CSV written to address the encoding problem(utf-8) for Python 2.7. Also provides support for csv.DictReader

Edit: Adding sample code that I used:

import ucsv as csv

#Read CSV file containing the right tags to produce
fileObj = open('awol_title_strings.csv', 'rb')
dictReader = csv.DictReader(fileObj, fieldnames = ['titles', 'tags'], delimiter = ',', quotechar = '"')
#Build a dictionary from the CSV file-> {<string>:<tags to produce>}
titleStringsDict = dict()
for row in dictReader:
    titleStringsDict.update({unicode(row['titles']):unicode(row['tags'])})

Using codecs.open as Alex Martelli suggested proved to be useful to me.

import codecs

delimiter = ';'
reader = codecs.open("your_filename.csv", 'r', encoding='utf-8')
for line in reader:
    row = line.split(delimiter)
    # do something with your row ...

The link to the help page is the same for python 2.6 and as far as I know there was no change in the csv module since 2.5 (besides bug fixes). Here is the code that just works without any encoding/decoding (file da.csv contains the same data as the variable data). I assume that your file should be read correctly without any conversions.

test.py:

## -*- coding: utf-8 -*-
#
# NOTE: this first line is important for the version b) read from a string(unicode) variable
#

import csv

data = \
"""0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu
0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris
0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert"""

# a) read from a file
print 'reading from a file:'
for (f1, f2, f3) in csv.reader(open('da.csv'), dialect=csv.excel):
    print (f1, f2, f3)

# b) read from a string(unicode) variable
print 'reading from a list of strings:'
reader = csv.reader(data.split('\n'), dialect=csv.excel)
for (f1, f2, f3) in reader:
    print (f1, f2, f3)

da.csv:

0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu
0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris
0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert

Looking at the Latin-1 unicode table, I see the character code 00E9 "LATIN SMALL LETTER E WITH ACUTE". This is the accented character in your sample data. A simple test in Python shows that UTF-8 encoding for this character is different from the unicode (almost UTF-16) encoding.

>>> u'\u00e9'
u'\xe9'
>>> u'\u00e9'.encode('utf-8')
'\xc3\xa9'
>>>

I suggest you try to encode("UTF-8") the unicode data before calling the special unicode_csv_reader(). Simply reading the data from a file might hide the encoding, so check the actual character values.

Had the same problem on another server, but realized that locales are messed.

export LC_ALL="en_US.UTF-8"

fixed the problem

참고URL : https://stackoverflow.com/questions/904041/reading-a-utf8-csv-file-with-python

'code' 카테고리의 다른 글

django 1.5-정적 태그 내에서 변수를 사용하는 방법 (0)	2020.09.09
$ scope와 $ rootScope의 차이점 (0)	2020.09.09
jQuery : 선택한 파일 이름 가져 오기 (0)	2020.09.09
상수 목록의 인라인 인스턴스화 (0)	2020.09.09
.NET 콘솔에서 JSON WebService를 호출하는 가장 좋은 방법 (0)	2020.09.09

현재글Python으로 UTF8 CSV 파일 읽기

codestyle

Python으로 UTF8 CSV 파일 읽기

Python으로 UTF8 CSV 파일 읽기

Python 2.X

Python 3.X

'code' 카테고리의 다른 글

'code'의 다른글

티스토리툴바

Python으로 UTF8 CSV 파일 읽기

Python으로 UTF8 CSV 파일 읽기

Python 2.X

Python 3.X

'code' 카테고리의 다른 글

'code'의 다른글

관련글

티스토리툴바