在 Pandas 中读取 csv 文件时出错[CParserError:标记数据时出错.C 错误:捕获到缓冲区溢出 - 可能是格式错误的输入文件.] [英] Error in Reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.]
问题描述
所以我尝试从一个文件夹中读取所有 csv 文件,然后将它们连接起来以创建一个大的 csv(所有文件的结构都相同),保存并再次读取.所有这些都是使用 Pandas 完成的.读取时发生错误.我在下面附上代码和错误.
So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The Error occurs while reading. I am Attaching the code and the Error below.
import pandas as pd
import numpy as np
import glob
path =r'somePath' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
store = pd.concat(list_)
store.to_csv("C:workDATARaw_data\store.csv", sep=',', index= False)
store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')
错误:-
CParserError Traceback (most recent call last)
<ipython-input-48-2983d97ccca6> in <module>()
----> 1 store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')
C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472 skip_blank_lines=skip_blank_lines)
473
--> 474 return _read(filepath_or_buffer, kwds)
475
476 parser_f.__name__ = name
C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in _read(filepath_or_buffer, kwds)
258 return parser
259
--> 260 return parser.read()
261
262 _parser_defaults = {
C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
719 raise ValueError('skip_footer not supported for iteration')
720
--> 721 ret = self._engine.read(nrows)
722
723 if self.options.get('as_recarray'):
C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
1168
1169 try:
-> 1170 data = self._reader.read(nrows)
1171 except StopIteration:
1172 if nrows is None:
pandasparser.pyx in pandas.parser.TextReader.read (pandasparser.c:7544)()
pandasparser.pyx in pandas.parser.TextReader._read_low_memory (pandasparser.c:7784)()
pandasparser.pyx in pandas.parser.TextReader._read_rows (pandasparser.c:8401)()
pandasparser.pyx in pandas.parser.TextReader._tokenize_rows (pandasparser.c:8275)()
pandasparser.pyx in pandas.parser.raise_parser_error (pandasparser.c:20691)()
CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
我也尝试使用 csv 阅读器:-
I tried using csv reader as well:-
import csv
with open("C:workDATARaw_data\store.csv", 'rb') as f:
reader = csv.reader(f)
l = list(reader)
错误:-
Error Traceback (most recent call last)
<ipython-input-36-9249469f31a6> in <module>()
1 with open('C:workDATARaw_data\store.csv', 'rb') as f:
2 reader = csv.reader(f)
----> 3 l = list(reader)
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
推荐答案
不是答案,但评论太长了(不是说代码格式)
Not an answer, but too long for a comment (not speaking of code formatting)
由于在csv模块中读取时断了,至少可以定位到发生错误的那一行:
As it breaks when you read it in csv module, you can at least locate the line where the error occurs:
import csv
with open(r"C:workDATARaw_datastore.csv", 'rb') as f:
reader = csv.reader(f)
linenumber = 1
try:
for row in reader:
linenumber += 1
except Exception as e:
print (("Error line %d: %s %s" % (linenumber, str(type(e)), e.message)))
然后在 store.csv 中查看该行发生了什么.
Then look in store.csv what happens at that line.
这篇关于在 Pandas 中读取 csv 文件时出错[CParserError:标记数据时出错.C 错误:捕获到缓冲区溢出 - 可能是格式错误的输入文件.]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!