pandas无法从大型StringIO对象读取 [英] pandas unable to read from large StringIO object
问题描述
我使用pandas来管理一个8字节整数的大数组。这些整数作为逗号分隔的CSV文件中列的空格分隔元素包含,并且数组大小约为10000x10000。
I'm using pandas to manage a large array of 8-byte integers. These integers are included as space-delimited elements of a column in a comma-delimited CSV file, and the array size is about 10000x10000.
Pandas能够快速读取逗号分隔的数据作为DataFrame从头几列,并快速存储空间分隔的字符串在另一个DataFrame与最少的麻烦。麻烦来了,当我尝试转换表从一列空格分隔的字符串到一个8位整数的DataFrame。
Pandas is able to quickly read the comma-delimited data from the first few columns as a DataFrame, and also quickly store the space-delimited strings in another DataFrame with minimal hassle. The trouble comes when I try to cast transform the table from a single column of space-delimited strings to a DataFrame of 8-bit integers.
我试过以下:
intdata = pd.DataFrame(strdata.columnname.str.split().tolist(), dtype='uint8')
但是内存使用是无法忍受的 - 10MB的整数消耗2GB的内存。我被告知这是语言的限制,在这种情况下我没有什么可以做的。
But the memory usage is unbearable - 10MB worth of integers consumes 2GB of memory. I'm told that it's a limitation of the language and there's nothing I can do about it in this case.
作为一种可能的解决方法,我被建议保存字符串数据复制到CSV文件,然后将CSV文件作为空格分隔的整数的DataFrame重新加载。这很好,但为了避免写入磁盘造成的减速,我尝试写入一个StringIO对象。
As a possible workaround, I was advised to save the string data to a CSV file and then reload the CSV file as a DataFrame of space-delimited integers. This works well, but to avoid the slowdown that comes from writing to disk, I tried writing to a StringIO object.
这是一个最小的非工作示例:
Here's a minimal non-working example:
import numpy as np
import pandas as pd
from cStringIO import StringIO
a = np.random.randint(0,256,(10000,10000)).astype('uint8')
b = pd.DataFrame(a)
c = StringIO()
b.to_csv(c, delimiter=' ', header=False, index=False)
d = pd.io.parsers.read_csv(c, delimiter=' ', header=None, dtype='uint8')
这会产生以下错误讯息:
Which yields the following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 443, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 228, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 533, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 670, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1032, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 486, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4494)
ValueError: No columns to parse from file
这是令人费解的,因为如果我运行完全相同的代码与'c.csv'
而不是 c
,代码工作完美。此外,如果我使用以下代码片段:
Which is puzzling, because if I run the exact same code with 'c.csv'
instead of c
, the code works perfectly. Also, if I use the following snippet:
file = open('c.csv', 'w')
file.write(c.getvalue())
CSV文件保存时没有任何问题,
The CSV file gets saved without any problems, so writing to the StringIO object is not the issue.
有可能需要用 c
替换在read_csv行中的c $ c> c.getvalue()
,但是当我这样做时,解释器尝试打印 c
终点站!当然有办法解决这个问题。
It is possible that I need to replace c
with c.getvalue()
in the read_csv line, but when I do that, the interpreter tries to print the contents of c
in the terminal! Surely there is a way to work around this.
先感谢您的帮助。
推荐答案
这里有两个问题,一个是根本,一个你根本没有碰到。 :^)
There are two issues here, one fundamental and one you simply haven't come across yet. :^)
首先,在写入 c
之后,文件。你需要 seek
回到开始。我们将使用较小的网格作为示例:
First, after you write to c
, you're at the end of the (virtual) file. You need to seek
back to the start. We'll use a smaller grid as an example:
>>> a = np.random.randint(0,256,(10,10)).astype('uint8')
>>> b = pd.DataFrame(a)
>>> c = StringIO()
>>> b.to_csv(c, delimiter=' ', header=False, index=False)
>>> next(c)
Traceback (most recent call last):
File "<ipython-input-57-73b012f9653f>", line 1, in <module>
next(c)
StopIteration
会产生「无列」错误。如果我们首先搜索
,则:
which generates the "no columns" error. If we seek
first, though:
>>> c.seek(0)
>>> next(c)
'103,3,171,239,150,35,224,190,225,57\n'
现在你会注意到第二个问题 - 逗号?我以为我们请求空格分隔符?但是 to_csv
只接受 sep
,而不是分隔符
。看来我应该接受它或反对,它不,但默默地忽略它感觉像一个错误。无论如何,如果我们使用 sep
(或 delim_whitespace = True
):
But now you'll notice the second issue-- commas? I thought we requested space delimiters? But to_csv
only accepts sep
, not delimiter
. Seems to me it should either accept it or object that it doesn't, but silently ignoring it feels like a bug. Anyway, if we use sep
(or delim_whitespace=True
):
>>> a = np.random.randint(0,256,(10,10)).astype('uint8')
>>> b = pd.DataFrame(a)
>>> c = StringIO()
>>> b.to_csv(c, sep=' ', header=False, index=False)
>>> c.seek(0)
>>> d = pd.read_csv(c, sep=' ', header=None, dtype='uint8')
>>> d
0 1 2 3 4 5 6 7 8 9
0 209 65 218 242 178 213 187 63 137 145
1 161 222 50 92 157 31 49 62 218 30
2 182 255 146 249 115 91 160 53 200 252
3 192 116 87 85 164 46 192 228 104 113
4 89 137 142 188 183 199 106 128 110 1
5 208 140 116 50 66 208 116 72 158 169
6 50 221 82 235 16 31 222 9 95 111
7 88 36 204 96 186 205 210 223 22 235
8 136 221 98 191 31 174 83 208 226 150
9 62 93 168 181 26 128 116 92 68 153
这篇关于pandas无法从大型StringIO对象读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!