Pandas read_csv和UTF-16 [英] Pandas read_csv and UTF-16

查看：1119 发布时间：2017/2/24 18:29:22 csv python-2.7 pandas utf-16

本文介绍了Pandas read_csv和UTF-16的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个以UTF-16编码的CSV文本文件（以便在其他人使用Excel时保留Unicode字符），但是当使用Pandas 0.9.0做一个read_csv时，我得到这个神秘的错误：

  df = pd.read_csv（'data.txt'，encoding ='utf-16'，sep ='\t'，header = 0）
 df.head（）
 
 ---------------------------------- ----------------------------------------- 
异常跟踪（最近call last）
< ipython-input-18-85da1383cd9e> in< module>（）
 ----> 1 df = pd.read_csv（'candidates-spanish.txt'，encoding ='utf-16'，sep ='\t'，header = 0）
 2 df.head（）
 b $ b /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_csv（filepath_or_buffer，sep，dialect，header，index_col，names，skiprows， 
 248 kdict ['delimiter'] = n_value，n_values，keep_default_na，thousands，comment，parse_dates，keep_date_col，dayfirst，date_parser，nrows，iterator，chunksize，skip_footer，converters，verbose，delimiter，encoding，squeeze，** kwds） sep 
 249 
  - > 250 return _read（TextParser，filepath_or_buffer，kdict）
 251 
 252 @Appender（_read_table_doc）
 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2。 7 / site-packages / pandas / io / parsers.pyc在_read（cls，filepath_or_buffer，kwds）
 198 return parser 
 199 
  - > 200 return parser.get_chunk（）
 201 
 202 @Appender（_read_csv_doc）
 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site -packages / pandas / io / parsers.pyc in get_chunk（self，rows）
 853 elif not self._has_complex_date_col：
 854 index = self._get_simple_index（alldata，columns）
  - > ; 855 index = self._agg_index（index）
 856 
 857 elif self._has_complex_date_col：
 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7 /site-packages/pandas/io/parsers.pyc在_agg_index（self，index，try_parse_dates）
 980 arr，_ = _convert_types（arr，col_na_values）
 981 arrays.append（arr）
  - > 982 index = MultiIndex.from_arrays（array，names = self.index_name）
 983 return index 
 984 
 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/ 
 1570 
 1571返回MultiIndex（levels = levels，labels = labels，$ b $）$ python2.7 / site-packages / pandas / core / index.pyc in_arrays b  - > 1572 sortorder = sortorder，names = names）
 1573 
 1574 @classmethod 
 
 /Library/Frameworks/Python.framework/Versions/2.7/lib/python2。 7 / site-packages / pandas / core / index.pyc in __new __（cls，levels，labels，sortorder，names）
 1254 assert（len（levels）== len（labels））
 1255 if len（levels）== 0：
  - > 1256 raise Exception（'必须传递非零的级别/标签数'）
 1257 
 1258如果len（levels）== 1：
 
异常：零个级别/标签

使用csv.reader逐行读取数据此示例表示我的数据格式不正确：

 来自io import BytesIO 
 import csv 
 
 with open（'data.txt'，'rb'）as f ：
r = f.read（）。decode（'utf-16'）。encode（'utf-8'）
 for l in csv.reader（BytesIO（r），delimiter ='\ t'）：
 print l 
 
 ['Country'，'State / City'，'Title'，'Date'，'Catalog'，'Wikipedia Election Page'，'Wikipedia Individual ''''''''''''''''''''''''''''''''''''''''''''''''''''country'''''''''''''''''''''' 12'，'Hugo Rafael Chavez Frias'，'Hugo Ch \xc3 \xa1vez'，'Hugo Ch \xc3 \xa1vez'，'Hugo Chavez'，'Hugo Ch \xc3 \xa1vez Fr\xc3 \ xadas'，'Hugo Chavez'，'Hugo Ch \xc3 \xa1vez'] 
 ['Venezuela'，'N / A'，'President'，'10 / 7/12'，'Henrique 
       pre> 
 
 是否有一些预处理，read_csv中的添加选项，或者需要在pandas.read_csv读取utf-16文件之前完成的其他操作？谢谢！ 
解决方案
这是一个错误，我认为是因为csv reader在开头传回一个额外的空行。它对我在Python 2.7.3和pandas 0.9.1如果我这样做：
 在[36]：pd.read_csv （BytesIO（fh.read（）。decode（'UTF-16'）。encode（'UTF-8'）），sep ='\t'，header = 0）
 Out [36]：
< class'pandas.core.frame.DataFrame'> 
 Int64Index：50个条目，0到49 
数据列：
国家43非空值
州/市43非空值
标题43非空值
日期43非空值
目录43非空值
维基百科选择Page 43非空值
维基百科个人Page 43非空值
选举国家中的机构43非空值
 Twitter 43非空值
 CANDIDATE NAME 1 43非空值
 CANDIDATE NAME 2 16非空值
 dtypes：object 11）
  
我在这里报告错误： https://github.com/pydata/pandas/issues/2418  
在github主机上，不幸的是导致c解析器中的segfault。 
 
 
 现在，有趣的是： http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful ;）
 
I have a CSV text file encoded in UTF-16 (so as to preserve Unicode characters when others use Excel) but when doing a read_csv with Pandas 0.9.0, I get this cryptic error:
df = pd.read_csv('data.txt',encoding='utf-16',sep='\t',header=0)
df.head()

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-18-85da1383cd9e> in <module>()
----> 1 df = pd.read_csv('candidates-spanish.txt',encoding='utf-16',sep='\t',header=0)
  2 df.head()

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze, **kwds)
248         kdict['delimiter'] = sep
249 
--> 250     return _read(TextParser, filepath_or_buffer, kdict)
251 
252 @Appender(_read_table_doc)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
198         return parser
199 
--> 200     return parser.get_chunk()
201 
202 @Appender(_read_csv_doc)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
853         elif not self._has_complex_date_col:
854             index = self._get_simple_index(alldata, columns)
--> 855             index = self._agg_index(index)
856 
857         elif self._has_complex_date_col:

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _agg_index(self, index, try_parse_dates)
980                 arr, _ = _convert_types(arr, col_na_values)
981                 arrays.append(arr)
--> 982             index = MultiIndex.from_arrays(arrays, names=self.index_name)
983         return index
984 

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1570 
1571         return MultiIndex(levels=levels, labels=labels,
-> 1572                           sortorder=sortorder, names=names)
1573 
1574     @classmethod

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names)
1254         assert(len(levels) == len(labels))
1255         if len(levels) == 0:
-> 1256             raise Exception('Must pass non-zero number of levels/labels')
1257 
1258         if len(levels) == 1:

Exception: Must pass non-zero number of levels/labels
Reading the data in line-by-line with csv.reader based on this example implies that my data is not incorrectly formatted:
from io import BytesIO
import csv

with open('data.txt','rb') as f:
    r = f.read().decode('utf-16').encode('utf-8')
    for l in csv.reader(BytesIO(r),delimiter='\t'):
        print l

['Country', 'State/City', 'Title', 'Date', 'Catalogue', 'Wikipedia Election Page', 'Wikipedia Individual Page', 'Electoral Institution in Country', 'Twitter', 'CANDIDATE NAME 1', 'CANDIDATE NAME 2']
['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez']
['Venezuela', 'N/A', 'President', '10/7/12', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles Radonski', 'Henrique Capriles R.', 'Henrique Capriles', '']
Is there some pre-processing, an addition option in read_csv, or something else that needs to be done before pandas.read_csv can read a utf-16 file? Thanks!  
 解决方案 
This is a bug, I think because csv reader was passing back an extra empty line in the beginning. It worked for me on Python 2.7.3 and pandas 0.9.1 if I do:
In [36]: pd.read_csv(BytesIO(fh.read().decode('UTF-16').encode('UTF-8')), sep='\t', header=0)
Out[36]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns:
Country                             43  non-null values
State/City                          43  non-null values
Title                               43  non-null values
Date                                43  non-null values
Catalogue                           43  non-null values
Wikipedia Election Page             43  non-null values
Wikipedia Individual Page           43  non-null values
Electoral Institution in Country    43  non-null values
Twitter                             43  non-null values
CANDIDATE NAME 1                    43  non-null values
CANDIDATE NAME 2                    16  non-null values
dtypes: object(11)
I reported the bug here: https://github.com/pydata/pandas/issues/2418
On github master it unfortunately causes a segfault in the c-parser. We'll fix it.

Now, interestingly: http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful ;)

                        这篇关于Pandas read_csv和UTF-16的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pandas read_csv和UTF-16 [英] Pandas read_csv and UTF-16

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pandas read_csv和UTF-16 [英] Pandas read_csv and UTF-16

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭