Python Pandas:对数据进行令牌化时出错. C错误:读取1GB CSV文件时,字符串内的EOF开始 [英] Python Pandas: Error tokenizing data. C error: EOF inside string starting when reading 1GB CSV file

查看:78
本文介绍了Python Pandas:对数据进行令牌化时出错. C错误:读取1GB CSV文件时,字符串内的EOF开始的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取10,000行的块中的1 GB CSV文件.该文件具有1106012行和171列,其他较小尺寸的文件未显示任何错误并成功完成,但是当我读取此1 GB文件时,每次在第1106011行(这是文件的倒数第二行)上均显示错误可以手动删除该行,但这不是解决方案,因为我有数百个相同大小的其他文件,而且我无法手动修复所有行.有人可以帮我吗?

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.

def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):

        df = pd.read_csv(input_file_name,
                         header=None,
                         nrows=size_of_chunk,
                         skiprows=eachRow,
                         low_memory=False,
                         error_bad_lines=False,
                         sep=',')
                         # engine='python'
                         # quoting=csv.QUOTE_NONE
                         # encoding='utf-8'

        df.columns = header_row
        df = df.drop_duplicates(keep='first')
        df = df.apply(lambda x: x.astype(str).str.lower())

        return df

然后我在循环中调用此函数,并且工作正常.

I'm then calling this function within a loop and works just fine.

huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)

我阅读了此 Pandas ParserError EOF字符当将多个csv文件读取到HDF5 时,此 read_csv ()&字符串中的EOF字符导致解析问题,并且此 https://github.com /pandas-dev/pandas/issues/11654 等,并尝试包含read_csv参数,例如

I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as

engine ='python'

engine='python'

quoting = csv.QUOTE_NONE////挂起,甚至挂起python shell,都不知道为什么

quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why

encoding ='utf-8'

encoding='utf-8'

但没有一个起作用,它仍然抛出以下错误

but none of it worked, its still throwing the following error

错误:

Traceback (most recent call last):
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
    huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
    sep=',')
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
  File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
  File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
  File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>> 

推荐答案

如果您使用的是Linux,请尝试删除所有不可打印的角色. 完成此操作后,尝试加载文件.

If you are under linux, try to remove all non printable caracter. Try to load your file after this operation.

tr -dc '[:print:]\n' < file > newfile

这篇关于Python Pandas:对数据进行令牌化时出错. C错误:读取1GB CSV文件时,字符串内的EOF开始的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆