使用带有缺失数据的pandas read_csv [英] using pandas read_csv with missing data

查看:243
本文介绍了使用带有缺失数据的pandas read_csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图读取一个csv文件,其中一些行可能会丢失数据块。

I am attempting to read a csv file where some rows may be missing chunks of data.

这似乎是导致pandas read_csv函数的问题,当你指定dtype。问题出现,为了从str转换为任何dtype指定pandas只是试图直接投射它。

This seems to be causing a problem with the pandas read_csv function when you specify the dtype. The problem appears that in order to convert from the str to whatever the dtype specifies pandas just tries to cast it directly. Therefore, if something is missing things break down.

一个MWE遵循(这个MWE使用StringIO代替一个真正的文件;然而,这个问题也发生在一个真实的文件正在使用)

A MWE follows (this MWE uses StringIO in place of a true file; however, the issue also happens with a real file being used)

import pandas as pd
import numpy as np
import io

datfile = io.StringIO("12 23 43| | 37| 12.23| 71.3\n12 23 55|X|   |      | 72.3")

names = ['id', 'flag', 'number', 'data', 'data2']
dtypes = [np.str, np.str, np.int, np.float, np.float]

dform = {name: dtypes[ind] for ind, name in enumerate(names)}

colconverters = {0: lambda s: s.strip(), 1: lambda s: s.strip()}

df = pd.read_table(datfile, sep='|', dtype=dform, converters=colconverters, header=None,
                   index_col=0, names=names, na_values=' ')

运行时出现的错误是

Traceback (most recent call last):
  File "pandas/parser.pyx", line 1084, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12580)
TypeError: Cannot cast array from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/aliounis/Repos/stellarpy/source/mwe.py", line 15, in <module>
    index_col=0, names=names, na_values=' ')
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
  File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
  File "pandas/parser.pyx", line 1090, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12656)
ValueError: invalid literal for int() with base 10: '   '

有没有可以解决这个问题。我查看了文档,但没有看到任何看起来像它会直接解决这个解决方案。这是一个需要报告给熊猫的错误​​吗?

Is there someway I can fix this. I looked through the documentation but didn't see anything that looked like it would directly address this solution. Is this just a bug that needs to be reported to panda?

推荐答案

因此,正如Merlin指出的,主要的问题是nan的不能是int,这可能是为什么熊猫这样开始行动。我不幸没有一个选择,所以我不得不对熊猫源代码自己做一些修改。我最终必须将文件parser.pyx的第1087-1096行更改为

So, as Merlin pointed out, the main problem is that nan's can't be ints, which is probably why pandas acts this way to begin with. I unfortunately didn't have a choice so I had to make some changes to the pandas source code myself. I ended up having to change lines 1087-1096 of the file parser.pyx to

        na_count_old = na_count
        print(col_res)
        for ind, row in enumerate(col_res):
            k = kh_get_str(na_hashset, row.strip().encode())
            if k != na_hashset.n_buckets:

                col_res[ind] = np.nan

                na_count += 1

            else:

                col_res[ind] = np.array(col_res[ind]).astype(col_dtype).item(0)

        if na_count_old==na_count:

            # float -> int conversions can fail the above
            # even with no nans
            col_res_orig = col_res
            col_res = col_res.astype(col_dtype)
            if (col_res != col_res_orig).any():
                raise ValueError("cannot safely convert passed user dtype of "
                                 "{col_dtype} for {col_res} dtyped data in "
                                 "column {column}".format(col_dtype=col_dtype,
                                                          col_res=col_res_orig.dtype.name,
                                                          column=i))

基本上遍历列的每个元素,检查每个元素是否包含在na列表中(注意,我们必须剥离这些东西,以使多空格显示为在na列表中)。如果是那么该元素被设置为双 np.nan 。如果它不在na列表中,那么它将被转换为为该列指定的原始dtype(这意味着该列将具有多个dtypes)。

which essentially goes through each element of a column, checks to see if each element is contained in the na list (note that we have to strip the stuff so that multi-spaces show up as being in the na list). If it is then that element is set as a double np.nan. If it is not in the na list then it is cast to the original dtype specified for that column (that means the column will have multiple dtypes).

这是一个完美的解决方案(可能很慢)它适用于我的需要,也许有人有类似的问题会发现它有用的。

While this isn't a perfect fix (and is likely slow) it works for my needs and maybe someone else who has a similar problem will find it useful.

这篇关于使用带有缺失数据的pandas read_csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆