将多个 csv 文件读取到 HDF5 时出现 Pandas ParserError EOF 字符 [英] Pandas ParserError EOF character when reading multiple csv files to HDF5

查看:28
本文介绍了将多个 csv 文件读取到 HDF5 时出现 Pandas ParserError EOF 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Python3,Pandas 0.12

Using Python3, Pandas 0.12

我正在尝试将多个 csv 文件(总大小为 7.9 GB)写入 HDF5 存储以供以后处理.csv 文件每个包含大约一百万行,15 列,数据类型主要是字符串,但也有一些浮点数.但是,当我尝试读取 csv 文件时,出现以下错误:

I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:

Traceback (most recent call last):
  File "filter-1.py", line 38, in <module>
    to_hdf()
  File "filter-1.py", line 31, in to_hdf
    for chunk in reader:
  File "C:Python33libsite-packagespandasioparsers.py", line 578, in __iter__
    yield self.read(self.chunksize)
  File "C:Python33libsite-packagespandasioparsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:Python33libsite-packagespandasioparsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandasparser.c:6745)
  File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandasparser.c:7146)
  File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandasparser.c:7568)
  File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandasparser.c:7451)
  File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandasparser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done 

编辑:

我设法找到了产生此问题的文件.我认为它正在读取 EOF 字符.但是,我不知道如何克服这个问题.鉴于组合文件的大尺寸,我认为检查每个字符串中的每个单个字符太麻烦了.(即便如此,我仍然不确定该怎么做.)据我检查,csv 文件中没有可能引发错误的奇怪字符.我也尝试将 error_bad_lines=False 传递给 pd.read_csv(),但错误仍然存​​在.

I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error. I also tried passing error_bad_lines=False to pd.read_csv(), but the error persists.

我的代码如下:

# -*- coding: utf-8 -*-

import pandas as pd
import os
from glob import glob


def list_files(path=os.getcwd()):
    ''' List all files in specified path '''
    list_of_files = [f for f in glob('2013-06*.csv')]
    return list_of_files


def to_hdf():
    """ Function that reads multiple csv files to HDF5 Store """
    # Defining path name
    path = 'ta_store.h5'
    # If path exists delete it such that a new instance can be created
    if os.path.exists(path):
        os.remove(path)
    # Creating HDF5 Store
    store = pd.HDFStore(path)

    # Reading csv files from list_files function
    for f in list_files():
        # Creating reader in chunks -- reduces memory load
        reader = pd.read_csv(f, chunksize=50000)
        # Looping over chunks and storing them in store file, node name 'ta_data'
        for chunk in reader:
            chunk.to_hdf(store, 'ta_data', mode='w', table=True)

    # Return store
    return store.select('ta_data')
    return 'Finished reading to HDF5 Store, continuing processing data.'

to_hdf()

编辑

如果我进入引发 CParserError EOF...的 CSV 文件并手动删除导致问题的行之后的所有行,则 csv 文件将被正确读取.但是无论如何我要删除的都是空白行.奇怪的是,当我手动更正错误的 csv 文件时,它们会单独加载到商店中.但是当我再次使用多个文件的列表时,假"文件仍然返回错误.

If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway. The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.

推荐答案

我遇到了类似的问题.用EOF inside string"列出的行有一个字符串,其中包含一个单引号.当我添加选项 quoting=csv.QUOTE_NONE 时,它解决了我的问题.

I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark. When I added the option quoting=csv.QUOTE_NONE it fixed my problem.

例如:

import csv
df = pd.read_csv(csvfile, header = None, delimiter="	", quoting=csv.QUOTE_NONE, encoding='utf-8')

这篇关于将多个 csv 文件读取到 HDF5 时出现 Pandas ParserError EOF 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆