Pandas read_csv和UTF-16 [英] Pandas read_csv and UTF-16
问题描述
我有一个以UTF-16编码的CSV文本文件(以便在其他人使用Excel时保留Unicode字符),但是当使用Pandas 0.9.0做一个read_csv时,我得到这个神秘的错误:
df = pd.read_csv('data.txt',encoding ='utf-16',sep ='\t',header = 0)
df.head()
---------------------------------- -----------------------------------------
异常跟踪(最近call last)
< ipython-input-18-85da1383cd9e> in< module>()
----> 1 df = pd.read_csv('candidates-spanish.txt',encoding ='utf-16',sep ='\t',header = 0)
2 df.head()
b $ b /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer,sep,dialect,header,index_col,names,skiprows,
248 kdict ['delimiter'] = n_value,n_values,keep_default_na,thousands,comment,parse_dates,keep_date_col,dayfirst,date_parser,nrows,iterator,chunksize,skip_footer,converters,verbose,delimiter,encoding,squeeze,** kwds) sep
249
- > 250 return _read(TextParser,filepath_or_buffer,kdict)
251
252 @Appender(_read_table_doc)
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2。 7 / site-packages / pandas / io / parsers.pyc在_read(cls,filepath_or_buffer,kwds)
198 return parser
199
- > 200 return parser.get_chunk()
201
202 @Appender(_read_csv_doc)
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site -packages / pandas / io / parsers.pyc in get_chunk(self,rows)
853 elif not self._has_complex_date_col:
854 index = self._get_simple_index(alldata,columns)
- > ; 855 index = self._agg_index(index)
856
857 elif self._has_complex_date_col:
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7 /site-packages/pandas/io/parsers.pyc在_agg_index(self,index,try_parse_dates)
980 arr,_ = _convert_types(arr,col_na_values)
981 arrays.append(arr)
- > 982 index = MultiIndex.from_arrays(array,names = self.index_name)
983 return index
984
/Library/Frameworks/Python.framework/Versions/2.7/lib/
1570
1571返回MultiIndex(levels = levels,labels = labels,$ b $)$ python2.7 / site-packages / pandas / core / index.pyc in_arrays b - > 1572 sortorder = sortorder,names = names)
1573
1574 @classmethod
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2。 7 / site-packages / pandas / core / index.pyc in __new __(cls,levels,labels,sortorder,names)
1254 assert(len(levels)== len(labels))
1255 if len(levels)== 0:
- > 1256 raise Exception('必须传递非零的级别/标签数')
1257
1258如果len(levels)== 1:
异常:零个级别/标签
使用csv.reader逐行读取数据此示例表示我的数据格式不正确:
来自io import BytesIO
pre>
import csv
with open('data.txt','rb')as f :
r = f.read()。decode('utf-16')。encode('utf-8')
for l in csv.reader(BytesIO(r),delimiter ='\ t'):
print l
['Country','State / City','Title','Date','Catalog','Wikipedia Election Page','Wikipedia Individual ''''''''''''''''''''''''''''''''''''''''''''''''''''country'''''''''''''''''''''' 12','Hugo Rafael Chavez Frias','Hugo Ch \xc3 \xa1vez','Hugo Ch \xc3 \xa1vez','Hugo Chavez','Hugo Ch \xc3 \xa1vez Fr\xc3 \ xadas','Hugo Chavez','Hugo Ch \xc3 \xa1vez']
['Venezuela','N / A','President','10 / 7/12','Henrique
是否有一些预处理,read_csv中的添加选项,或者需要在pandas.read_csv读取utf-16文件之前完成的其他操作?谢谢!
解决方案这是一个错误,我认为是因为csv reader在开头传回一个额外的空行。它对我在Python 2.7.3和pandas 0.9.1如果我这样做:
在[36]:pd.read_csv (BytesIO(fh.read()。decode('UTF-16')。encode('UTF-8')),sep ='\t',header = 0)
Out [36]:
< class'pandas.core.frame.DataFrame'>
Int64Index:50个条目,0到49
数据列:
国家43非空值
州/市43非空值
标题43非空值
日期43非空值
目录43非空值
维基百科选择Page 43非空值
维基百科个人Page 43非空值
选举国家中的机构43非空值
Twitter 43非空值
CANDIDATE NAME 1 43非空值
CANDIDATE NAME 2 16非空值
dtypes:object 11)
我在这里报告错误: https://github.com/pydata/pandas/issues/2418
在github主机上,不幸的是导致c解析器中的segfault。
现在,有趣的是: http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful ;)
I have a CSV text file encoded in UTF-16 (so as to preserve Unicode characters when others use Excel) but when doing a read_csv with Pandas 0.9.0, I get this cryptic error:
df = pd.read_csv('data.txt',encoding='utf-16',sep='\t',header=0) df.head() --------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-18-85da1383cd9e> in <module>() ----> 1 df = pd.read_csv('candidates-spanish.txt',encoding='utf-16',sep='\t',header=0) 2 df.head() /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze, **kwds) 248 kdict['delimiter'] = sep 249 --> 250 return _read(TextParser, filepath_or_buffer, kdict) 251 252 @Appender(_read_table_doc) /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds) 198 return parser 199 --> 200 return parser.get_chunk() 201 202 @Appender(_read_csv_doc) /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows) 853 elif not self._has_complex_date_col: 854 index = self._get_simple_index(alldata, columns) --> 855 index = self._agg_index(index) 856 857 elif self._has_complex_date_col: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _agg_index(self, index, try_parse_dates) 980 arr, _ = _convert_types(arr, col_na_values) 981 arrays.append(arr) --> 982 index = MultiIndex.from_arrays(arrays, names=self.index_name) 983 return index 984 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names) 1570 1571 return MultiIndex(levels=levels, labels=labels, -> 1572 sortorder=sortorder, names=names) 1573 1574 @classmethod /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names) 1254 assert(len(levels) == len(labels)) 1255 if len(levels) == 0: -> 1256 raise Exception('Must pass non-zero number of levels/labels') 1257 1258 if len(levels) == 1: Exception: Must pass non-zero number of levels/labels
Reading the data in line-by-line with csv.reader based on this example implies that my data is not incorrectly formatted:
from io import BytesIO import csv with open('data.txt','rb') as f: r = f.read().decode('utf-16').encode('utf-8') for l in csv.reader(BytesIO(r),delimiter='\t'): print l ['Country', 'State/City', 'Title', 'Date', 'Catalogue', 'Wikipedia Election Page', 'Wikipedia Individual Page', 'Electoral Institution in Country', 'Twitter', 'CANDIDATE NAME 1', 'CANDIDATE NAME 2'] ['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez'] ['Venezuela', 'N/A', 'President', '10/7/12', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles R.', 'Henrique Capriles', '']
Is there some pre-processing, an addition option in read_csv, or something else that needs to be done before pandas.read_csv can read a utf-16 file? Thanks!
解决方案This is a bug, I think because csv reader was passing back an extra empty line in the beginning. It worked for me on Python 2.7.3 and pandas 0.9.1 if I do:
In [36]: pd.read_csv(BytesIO(fh.read().decode('UTF-16').encode('UTF-8')), sep='\t', header=0) Out[36]: <class 'pandas.core.frame.DataFrame'> Int64Index: 50 entries, 0 to 49 Data columns: Country 43 non-null values State/City 43 non-null values Title 43 non-null values Date 43 non-null values Catalogue 43 non-null values Wikipedia Election Page 43 non-null values Wikipedia Individual Page 43 non-null values Electoral Institution in Country 43 non-null values Twitter 43 non-null values CANDIDATE NAME 1 43 non-null values CANDIDATE NAME 2 16 non-null values dtypes: object(11)
I reported the bug here: https://github.com/pydata/pandas/issues/2418 On github master it unfortunately causes a segfault in the c-parser. We'll fix it.
Now, interestingly: http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful ;)
这篇关于Pandas read_csv和UTF-16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!