如何在Pandas read_csv函数的负载中过滤行? [英] How can I filter lines on load in Pandas read_csv function?
问题描述
如何使用熊猫过滤CSV的哪些行加载到内存中?这似乎是一个应该在read_csv
中找到的选项.我想念什么吗?
How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv
. Am I missing something?
示例:我们有一个带时间戳列的CSV,我们只想加载时间戳大于给定常量的行.
Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.
推荐答案
在将CSV文件加载到熊猫对象之前,没有选项可以过滤行.
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
您可以加载文件,然后使用df[df['field'] > constant]
进行过滤,或者如果您有一个非常大的文件,并且担心内存用完,那么在连接文件块时,使用迭代器并应用过滤器,例如:
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
您可以更改chunksize
以适合您的可用内存.请参见此处更多细节.
You can vary the chunksize
to suit your available memory. See here for more details.
这篇关于如何在Pandas read_csv函数的负载中过滤行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!