pandas.io.common.CParserError:标记数据时出错.C错误:捕获到缓冲区溢出-可能是格式错误的输入文件 [英] pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file

查看:105
本文介绍了pandas.io.common.CParserError:标记数据时出错.C错误:捕获到缓冲区溢出-可能是格式错误的输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大型的csv文件,每个文件的大小都超过10 mb,大约有50多个这样的文件.这些输入具有超过25列和超过50K的行.

I have large csv files with size more than 10 mb each and about 50+ such files. These inputs have more than 25 columns and more than 50K rows.

所有这些都有相同的标头,而我试图将它们合并到一个csv中,而标头只被提及一次.

All these have same headers and I am trying to merge them into one csv with headers to be mentioned only one time.

选项:一个代码:适用于小型csv-25列以上,但文件大小以kbs为单位.

Option: One Code: Working for small sized csv -- 25+ columns but size of the file in kbs.

import pandas as pd
import glob

interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))

full_df = pd.concat(df_list)

full_df.to_csv('output.csv')

但是上面的代码不适用于较大的文件,并给出了错误.

But the above code does not work for the larger files and gives the error.

错误:

Traceback (most recent call last):
  File "merge_large.py", line 6, in <module>
    all_files = glob.glob("*.csv", encoding='utf8', engine='python')     
TypeError: glob() got an unexpected keyword argument 'encoding'
lakshmi@lakshmi-HP-15-Notebook-PC:~/Desktop/Twitter_Lat_lon/nasik_rain/rain_2$ python merge_large.py 
Traceback (most recent call last):
  File "merge_large.py", line 10, in <module>
    df = pd.read_csv(file_,index_col=None, header=0)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
  File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

代码:25列以上,但文件大小超过10mb

Code: Columns 25+ but size of the file more than 10mb

选项:两个选项:三个

选项:四个

import pandas as pd
import glob

    interesting_files = glob.glob("*.csv")
    df_list = []
    for filename in sorted(interesting_files):
        df_list.append(pd.read_csv(filename))

    full_df = pd.concat(df_list)

    full_df.to_csv('output.csv')

错误:

Traceback (most recent call last):
  File "merge_large.py", line 6, in <module>
    allFiles = glob.glob("*.csv", sep=None)
TypeError: glob() got an unexpected keyword argument 'sep'

我进行了广泛的搜索,但找不到用于将具有相同标头的大型csv文件连接到一个文件的解决方案.

I have searched extensively but I am not able to find a solution to concatenate large csv files with same headers into one file.

代码:

import dask.dataframe as dd  

ddf = dd.read_csv('*.csv')

ddf.to_csv('master.csv',index=False)

错误:

Traceback (most recent call last):
  File "merge_csv_dask.py", line 5, in <module>
    ddf.to_csv('master.csv',index=False)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 792, in to_csv
    return to_csv(self, filename, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io.py", line 762, in to_csv
    compute(*values)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 179, in compute
    results = get(dsk, keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 58, in get
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 481, in get_async
    raise(remote_exception(res, tb))
dask.async.ValueError: could not convert string to float: {u'type': u'Point', u'coordinates': [4.34279, 50.8443]}

Traceback
---------
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 263, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 245, in _execute_task
    return func(*args2)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 49, in bytes_read_csv
    coerce_dtypes(df, dtypes)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 73, in coerce_dtypes
    df[c] = df[c].astype(dtypes[c])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2950, in astype
    raise_on_error=raise_on_error, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2938, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2890, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 434, in astype
    values=values, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 477, in _astype
    values = com._astype_nansafe(values.ravel(), dtype, copy=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/common.py", line 1920, in _astype_nansafe
    return arr.astype(dtype

)

推荐答案

如果我理解您的问题,则您的csv文件较大,其结构与您要合并为一个大CSV文件的结构相同.

If I understand your problem, you have large csv files with the same structure that you want to merge into one big CSV file.

我的建议是使用 dask (来自Continuum Analytics)来处理此工作.您可以合并文件,但也可以像熊猫一样执行核外计算和数据分析.

My suggestion is to use dask from Continuum Analytics to handle this job. You can merge your files but also perform out-of-core computations and analysis of the data just like pandas.

### make sure you include the [complete] tag
pip install dask[complete]

使用DropBox中的示例数据进行解决方案

首先,检查dask的版本.对我来说,dask = 0.11.0,pandas = 0.18.1

Solution Using Your Sample Data from DropBox

First, check versions of dask. For me, dask = 0.11.0 and pandas = 0.18.1

import dask
import pandas as pd
print (dask.__version__)
print (pd.__version__)

这是要在所有csvs中读取的代码.使用您的DropBox示例数据,我没有任何错误.

Here's the code to read in ALL your csvs. I had no errors using your DropBox example data.

import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
import glob

filenames = glob.glob('/Users/linwood/Downloads/stack_bundle/rio*.csv')

'''
The key to getting around the CParse error was using sep=None
Came from this post
http://stackoverflow.com/questions/37505577/cparsererror-error-tokenizing-data
'''

# custom saver function for dataframes using newfilenames
def reader(filename):
    return pd.read_csv(filename,sep=None)

# build list of delayed pandas csv reads; then read in as dask dataframe

dfs = [delayed(reader)(fn) for fn in filenames]
df = dd.from_delayed(dfs)


'''
This is the final step.  The .compute() code below turns the 
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in 
dask. Get a subset of the data you want and save that.  
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')

剩下的就是多余的东西

# print the count of values in each column; perfect data would have the same count
# you have dirty data as the counts will show

print (df.count().compute())

下一步是进行一些类似熊猫的分析.这是我的一些代码,首先为"tweetFavoriteCt"列清除"您的数据.所有数据都不是整数,因此我将字符串替换为"0",然后将其他所有内容都转换为整数.获得整数转换后,我将显示一个简单的分析,在该分析中,我将对整个数据框进行过滤,以仅包括favoriteCt大于3的行.

The next step is doing some pandas-like analysis. Here is some code of me first "cleaning" your data for the 'tweetFavoriteCt' column. All of the data is not an integer, so I replace strings with "0" and convert everything else to an integer. Once I get the integer conversion, I show a simple analytic where I filter the entire dataframe to only include the rows where the favoriteCt is greater than 3

# function to convert numbers to integer and replace string with 0; sample analytics in dask dataframe
# you can come up with your own..this is just for an example
def conversion(value):
    try:
        return int(value)
    except:
        return int(0)

# apply the function to the column, create a new column of cleaned data
clean = df['tweetFavoriteCt'].apply(lambda x: (conversion(x)),meta=('stuff',str))

# set new column equal to our cleaning code above; your data is dirty :-(
df['cleanedFavoriteCt'] = clean

最后的代码显示了模糊的分析以及如何将合并后的文件加载到熊猫中,以及如何将合并后的文件写入磁盘.请注意,如果您有大量的CSV,则当您使用下面的 .compute()代码时,会将合并的csv加载到内存中.

Last bit of code shows dask analysis and how to load this merged file into pandas and also write the merged file to disk. Be warned, if you have tons of CSVs, when you use the .compute() code below, it will load this merged csv into memory.

# retreive the 50 tweets with the highest favorite count 
print(df.nlargest(50,['cleanedFavoriteCt']).compute())

# only show me the tweets that have been favorited at least 3 times
# TweetID 763525237166268416, is VERRRRY popular....7000+ favorites
print((df[df.cleanedFavoriteCt.apply(lambda x: x>3,meta=('stuff',str))]).compute())

'''
This is the final step.  The .compute() code below turns the 
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in 
dask. Get a subset of the data you want and save that.  
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')

现在,如果要为合并的csv文件切换到熊猫:

Now, if you want to switch to pandas for the merged csv file:

import pandas as pd
dff = pd.read_csv('./test.csv')

让我知道这是否可行.

在此处停止

第一步是确保已安装 dask .文档页面中有 dask 的安装说明但这应该可以工作:

The first step is making sure you have dask installed. There are install instructions for dask in the documentation page but this should work:

安装了dask后,很容易读取文件.

With dask installed it's easy to read in the files.

先做一些家务.假设我们有一个包含csvs的目录,其中文件名是 my18.csv my19.csv my20.csv 等.名称标准化和单个目录位置是关键.如果将csv文件放在一个目录中并以某种方式序列化名称,则此方法有效.

Some housekeeping first. Assume we have a directory with csvs where the filenames are my18.csv, my19.csv, my20.csv, etc. Name standardization and single directory location are key. This works if you put your csv files in one directory and serialize the names in some way.

分步进行:

  1. 导入dask,使用通配符读取所有csv文件.这会将所有csv合并到一个单独的 dask.dataframe 对象中.如果需要,您可以在此步骤后立即执行类似熊猫的操作.
  1. Import dask, read all the csv files in using wildcard. This merges all csvs into one single dask.dataframe object. You can do pandas-like operation immediately after this step if you want.

import dask.dataframe as dd  
ddf = dd.read_csv('./daskTest/my*.csv')
ddf.describe().compute()

  1. 将合并的数据帧文件写到磁盘上与原始文件相同的目录中,并将其命名为 master.csv

ddf.to_csv('./daskTest/master.csv',index=False)

  1. 可选,将大小更大得多的 master.csv 读入dask.dataframe对象以进行计算.也可以在上述第一步之后完成;dask可以对暂存的文件执行类似操作的熊猫...这是在Python中执行大数据"的一种方式
  1. Optional, read master.csv, a much bigger in size, into dask.dataframe object for computations. This can also be done after step one above; dask can perform pandas like operations on the staged files...this is a way to do "big data" in Python

# reads in the merged file as one BIG out-of-core dataframe; can perform functions like pangas    
newddf = dd.read_csv('./daskTest/master.csv')

#check the length; this is now length of all merged files. in this example, 50,000 rows times 11 = 550000 rows.
len(newddf)

# perform pandas-like summary stats on entire dataframe
newddf.describe().compute()

希望这有助于回答您的问题.在三个步骤中,您将读取所有文件,合并为单个数据帧,然后仅使用一个标头和所有行将海量数据帧写入磁盘.

Hopefully this helps answer your question. In three steps, you read in all the files, merge to single dataframe, and write that massive dataframe to disk with only one header and all your rows.

这篇关于pandas.io.common.CParserError:标记数据时出错.C错误:捕获到缓冲区溢出-可能是格式错误的输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆