UnicodeDecodeError:'utf-8'编解码器在读取 pandas 中的csv文件时无法解码位置1的字节0x8b:无效的起始字节 [英] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte, while reading csv file in pandas
问题描述
我知道已经问过类似的问题,我已经看过所有问题并尝试了,但几乎没有帮助.我正在使用OSX 10.11 El Capitan,python3.6.,虚拟环境,也尝试过不使用它.我正在使用jupyter笔记本和spyder3.
I know similar questions has been asked already I have seen all of them and tried but of little help. I am using OSX 10.11 El Capitan, python3.6., virtual environment, tried without that also. I am using jupyter notebook and spyder3.
我是python的新手,但是了解基本的ML并关注以下文章以学习如何解决Kaggle的挑战:链接到数据集
I am new to python, but know basic ML and following a post to learn how to solve Kaggle challenges: Link to Blog, Link to Data Set
.我陷入了代码的前几行 `
.I am stuck at the first few lines of code `
import pandas as pd
destinations = pd.read_csv("destinations.csv")
test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")
这给了我错误
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-19-a928a98eb1ff> in <module>()
1 import pandas as pd
----> 2 df = pd.read_csv('destinations.csv', compression='infer',date_parser=True, usecols=([0,1,3]))
3 df.head()
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
403
404 # Create the parser.
--> 405 parser = TextFileReader(filepath_or_buffer, **kwds)
406
407 if chunksize or iterator:
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
762 self.options['has_index_names'] = kwds['has_index_names']
763
--> 764 self._make_engine(self.engine)
765
766 def close(self):
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
983 def _make_engine(self, engine='c'):
984 if engine == 'c':
--> 985 self._engine = CParserWrapper(self.f, **self.options)
986 else:
987 if engine == 'python':
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1603 kwds['allow_leading_cols'] = self.index_col is not False
1604
-> 1605 self._reader = parsers.TextReader(src, **kwds)
1606
1607 # XXX
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:6175)()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header (pandas/_libs/parsers.c:9691)()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
关于stakoverflow的一些答案表明,这是因为它已压缩,但Chrome浏览器下载了.csv文件,而无处可见.csv.gz,找不到返回的文件错误.
Some answers on stakoverflow suggested that it is because it is gzipped, but Chrome downloaded the .csv file and .csv.gz was nowhere to be seen and returned file not found error.
然后我在某处阅读以使用encoding='latin1'
,但是这样做之后,我得到了解析器错误:
I then read somewhere to use encoding='latin1'
, but after doing this I am getting parser error:
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-21-f9c451f864a2> in <module>()
1 import pandas as pd
2
----> 3 destinations = pd.read_csv("destinations.csv",encoding='latin1')
4 test = pd.read_csv("test.csv")
5 train = pd.read_csv("train.csv")
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
653 skip_blank_lines=skip_blank_lines)
654
--> 655 return _read(filepath_or_buffer, kwds)
656
657 parser_f.__name__ = name
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
409
410 try:
--> 411 data = parser.read(nrows)
412 finally:
413 parser.close()
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
1003 raise ValueError('skipfooter not supported for iteration')
1004
-> 1005 ret = self._engine.read(nrows)
1006
1007 if self.options.get('as_recarray'):
/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
1746 def read(self, nrows=None):
1747 try:
-> 1748 data = self._reader.read(nrows)
1749 except StopIteration:
1750 if self._first_chunk:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)()
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)()
ParserError: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
我花了几个小时来调试它,试图在Atom上打开CSV文件(没有其他应用程序可以打开它),在线Web应用程序(有些崩溃了),但是没有帮助.我尝试使用其他人的内核解决了问题但没有帮助的人.
I have spent hours to debug this, tried to open the csv files on Atom( no other app could open it), online web-apps(some crashed) but of no help.I have tried using the kernels of other people who have solved the problem, but of no help.
推荐答案
仍然是最可能压缩的数据. gzip的幻数是0x1f 0x8b
,与您获得的UnicodeDecodeError
一致.
It's still most likely gzipped data. gzip's magic number is 0x1f 0x8b
, which is consistent with the UnicodeDecodeError
you get.
您可以尝试动态解压缩数据:
You could try decompressing the data on the fly:
with open('destinations.csv', 'rb') as fd:
gzip_fd = gzip.GzipFile(fileobj=fd)
destinations = pd.read_csv(gzip_fd)
这篇关于UnicodeDecodeError:'utf-8'编解码器在读取 pandas 中的csv文件时无法解码位置1的字节0x8b:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!