从嵌套列表创建DataFrame时,可以访问read_csv()的dtype推理引擎吗? [英] Can I access read_csv()'s dtype inference engine when creating a DataFrame from a nested list?

查看:79
本文介绍了从嵌套列表创建DataFrame时,可以访问read_csv()的dtype推理引擎吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是在与piRSquared 此处进行的讨论之后得出的,在这里我发现read_csv似乎具有其自己的类型推断获得正确类型的能力似乎更广泛的方法.在缺少数据的情况下,它似乎更具容错能力,选择NaN而不是将ValueError作为其默认行为.

This follows from a discussion with piRSquared here, where I found that read_csv seems to have its own type inference methods that appear to be broader in their ability to obtain the correct type. It also appears to be more fault-tolerant in the case of missing data, electing for NaN instead of throwing ValueError as its default behaviour.

在很多情况下,推断的数据类型对于我的工作而言是完全可以接受的,但是在实例化DataFrame或我可以找到的API中的其他任何地方时,似乎并未公开此功能.不必要地手动处理dtypes.如果您有数百个列,这可能会很乏味.我能找到的最接近的是 convert_objects() 但在这种情况下,它不能处理这些问题.我可以使用的替代方法是转储到磁盘并读回,这效率很低.

There's a lot of cases where the inferred datatypes are perfectly acceptable for my work but this functionality doesn't seem to be exposed when instantiating a DataFrame, or anywhere else in the API that I can find, meaning that I have to manually deal with dtypes unnecessarily. This can be tedious if you have hundreds of columns. The closest I can find is convert_objects() but it doesn't handle the bools in this case. The alternative I could use is to dump to disk and read it back in, which is grossly inefficient.

下面的示例说明了read_csv的默认行为与设置dtype的常规方法的默认行为(在V 0.20.3中正确).有没有一种方法可以访问read_csv的类型推断而无需转储到磁盘?更普遍地说,read_csv的行为是否有这样的原因?

The below example illustrates the default behaviour of read_csv vs. the default behaviour of the conventional methods for setting dtype (correct in V 0.20.3). Is there a way to access the type inference of read_csv without dumping to disk? More generally, is there a reason why read_csv behaves like this?

示例:

import numpy as np
import pandas as pd
import csv

data = [['string_boolean', 'numeric', 'numeric_missing'], 
        ['FALSE', 23, 50], 
        ['TRUE', 19, 12], 
        ['FALSE', 4.8, '']]

with open('my_csv.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerows(data)

# Reading from CSV
df = pd.read_csv('my_csv.csv')
print(df.string_boolean.dtype) # Automatically converted to bool
print(df.numeric.dtype) # Float, as expected
print(df.numeric_missing.dtype) # Float, doesn't care about empty string

# Creating directly from list without supplying datatypes
df2 = pd.DataFrame(data[1:], columns=data[0])
df2.string_boolean = df2.string_boolean.astype(bool) # Doesn't work - ValueError
df2.numeric_missing = df2.numeric_missing.astype(np.float64) # Doesn't work

# Creating but forcing dtype doesn't work
df3 = pd.DataFrame(data[1:], columns=data[0], 
                   dtype=[bool, np.float64, np.float64])

# The working method
df4 = pd.DataFrame(data[1:], columns=data[0])
df4.string_boolean.map({'TRUE': True, 'FALSE': False})
df4.numeric_missing = pd.to_numeric(df4.numeric_missing)

推荐答案

一种解决方案是使用StringIO对象.唯一的区别是,它会将所有数据保留在内存中,而不是写入磁盘并回读.

One solution is to use a StringIO object. The only difference is that it keeps all the data in memory, instead of writing to disk and reading back in.

代码如下(请注意:Python 3!):

Code is as follows (note: Python 3!):

import numpy as np
import pandas as pd
import csv
from io import StringIO

data = [['string_boolean', 'numeric', 'numeric_missing'],
        ['FALSE', 23, 50],
        ['TRUE', 19, 12],
        ['FALSE', 4.8, '']]

with StringIO() as fobj:
    writer = csv.writer(fobj)
    writer.writerows(data)
    fobj.seek(0)
    df = pd.read_csv(fobj)

print(df.head(3))
print(df.string_boolean.dtype) # Automatically converted to bool
print(df.numeric.dtype) # Float, as expected
print(df.numeric_missing.dtype) # Float, doesn't care about empty string

with StringIO() as fobj并不是真正必需的:fobj = String()也可以正常工作.并且由于上下文管理器将关闭StringIO()对象超出其范围,因此df = pd.read_csv(fobj)必须位于其内部.
还要注意fobj.seek(0),这是另一个必要条件,因为您的解决方案只是关闭并重新打开文件,这将自动将文件指针设置为文件的开头.

The with StringIO() as fobj isn't really necessary: fobj = String() will work just as fine. And since the context manager will close the StringIO() object outside its scope, the df = pd.read_csv(fobj) has to be inside it.
Note also the fobj.seek(0), which is another necessity, since your solution simply closes and reopens a file, which will automatically set the file pointer to the start of the file.

我实际上试图使上述代码与Python 2/3兼容.由于以下原因,这变得一团糟:Python 2具有一个io模块,就像Python 3一样,其StringIO类使所有内容都变为unicode(同样在Python 2中;在Python 3中,当然是默认值) .
很好,除了Python 2中的csv writer模块与 Unicode兼容.
因此,替代方法是使用(较旧的)Python 2 (c)StringIO模块,例如:

I actually tried to make the above code Python 2/3 compatible. That became a mess, because of the following: Python 2 has an io module, just like Python 3, whose StringIO class makes everything unicode (also in Python 2; in Python 3 it is, of course, the default).
That is great, except that the csv writer module in Python 2 is not unicode compatible.
Thus, the alternative is to use the (older) Python 2 (c)StringIO module, for example as follows:

try:
    from cStringIO import StringIO
except ModuleNotFoundError:  # Python 3
    from io import StringIO

在Python 2中将变为纯文本,而在Python 3中将变为unicode.
除了现在,cStringIO.StringIO没有上下文管理器,并且with语句将失败.正如我提到的那样,这并不是真正必要的,但是我正在使事情尽可能地接近您的原始代码.
换句话说,在没有可笑的黑客的情况下,我找不到找到与原始代码接近的好方法.

and things will be plain text in Python 2, and unicode in Python 3.
Except that now, cStringIO.StringIO does not have a context manager, and the with statement will fail. As I mentioned, it is not really necessary, but I was keeping things as close as possible to your original code.
In other words, I could not find a nice way to stay close to the original code without ridiculous hacks.

我还考虑过完全避免使用CSV编写器,从而导致:

I've also looked at avoiding the CSV writer completely, which leads to:

text = '\n'.join(','.join(str(item).strip("'") for item in items) 
                 for items in data)

with StringIO(text) as fobj:
    df = pd.read_csv(fobj)

这可能更整洁(虽然不太清晰),与 Python 2/3兼容. (我不希望它能在csv模块可以处理的所有事情上正常工作,但是在这里工作正常.)

which is perhaps neater (though a bit less clear), and Python 2/3 compatible. (I don't expect it to work for everything that the csv module can handle, but here it works fine.)

在这里,我只能推测.

我认为原因是,当输入是Python对象(字典,列表)时,输入是已知的,并由程序员掌握.因此,该输入不太可能包含诸如'FALSE'''之类的字符串.相反,它通常包含对象Falsenp.nan(或math.nan),因为程序员本来应该已经处理了(字符串)转换.
而对于文件(CSV或其他文件),输入内容可以是任何内容:您的同事可能发送了Excel CSV文件,或者其他人向您发送了数字CSV文件.我不知道CSV文件的标准化程度,但是您可能需要一些代码来允许例外情况,并且总体上需要将字符串转换为Python(NumPy)格式.

I would think the reasoning is that when the input are Python objects (dicts, lists), the input is known, and in hands of the programmer. Therefore, it is unlikely, perhaps even illogical, that that input would contain strings such as 'FALSE' or ''. Instead, it would normally contain the objects False and np.nan (or math.nan), since the programmer would already have taken care of the (string) translation.
Whereas for a file (CSV or other), the input can be anything: your colleague might send an Excel CSV file, or someone else sends you a Gnumeric CSV file. I don't know how standardised CSV files are, but you'd probably need some code to allow for exceptions, and overall for the conversion of the strings to Python (NumPy) format.

因此从这种意义上讲,期望pd.DAtaFrame(...)接受任何内容实际上是不合逻辑的:相反,它应该接受正确格式化的内容.

So in that sense, it is actually illogical to expect pd.DAtaFrame(...) to accept just anything: instead, it should accept something that is properly formatted.

您可能会争辩一种采用像您这样的列表的便捷方法,但是列表不是CSV文件(它只是一堆字符,包括换行符).另外,我希望pd.read_csv可以读取大块文件(可能甚至逐行读取),如果您用换行符代替它来输入字符串,则将变得更加困难(您不能真正地逐行读取该文件)行,因为您必须将其分割在换行符上,并将所有行都保留在内存中.并且您已经在内存中某个位置(而不是磁盘上)存储了完整的字符串.但是我离题了.)

You might argue for a convenience method that takes a list like yours, but a list is not a CSV file (which is just a bunch of characters, including newlines). Plus, I expect there is the option for pd.read_csv to read the files in chunks (perhaps even line by line), which becomes harder if you'd feed it a string with newlines instead (you can't really read that line by line, as you would have to split it on newlines and keep all the lines in memory. And you already have the full string in memory somewhere, instead of on disk. But I digress).

此外,StringIO技巧仅几行即可精确执行此技巧.

Besides, the StringIO trick is just a few lines to precisely perform this trick.

这篇关于从嵌套列表创建DataFrame时,可以访问read_csv()的dtype推理引擎吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆