从python中的大型csv数据文件中提取少量数据行的高效方法 [英] efficient way to extract few lines of data from a large csv data file in python
问题描述
我有大量的csv数据文件,每个数据文件包含几天的价格数据,一个代码的形式如下:
I have a large number of csv data files, and each data file contains several days worth of tick data for one ticker in the following form :
ticker DD/MM/YYYY time bid ask
XXX, 19122014, 08:00:08.325, 9929.00,9933.00
XXX, 19122014, 08:00:08.523, 9924.00,9931.00
XXX, 19122014, 08:00:08.722, 9925.00,9930.50
XXX, 19122014, 08:00:08.921, 9924.00,9928.00
XXX, 19122014, 08:00:09.125, 9924.00,9928.00
…
XXX, 30122014, 21:56:25.181, 9795.50,9796.50
XXX, 30122014, 21:56:26.398, 9795.50,9796.50
XXX, 30122014, 21:56:26.598, 9795.50,9796.50
XXX, 30122014, 21:56:26.798, 9795.50,9796.50
XXX, 30122014, 21:56:28.896, 9795.50,9796.00
XXX, 30122014, 21:56:29.096, 9795.50,9796.50
XXX, 30122014, 21:56:29.296, 9795.50,9796.00
…
我需要提取时间在一定范围内的任何行数据,例如:09:00:00至09:15:00。我目前的解决方案是简单地读取每个数据文件到一个数据帧,按时间顺序排序,然后使用searchsorted找到09:00:00到09:15:00。它工作正常,如果性能不是一个问题,我没有1000个文件等待处理。任何建议如何提高速度?
I need to extract any lines of data whose time is within certain range, say: 09:00:00 to 09:15:00. My current solution is simply reading in each data file to a data frame, sorting it in order by time and then using searchsorted to find 09:00:00 to 09:15:00. It works fine if performance isn't an issue and I don't have 1000 files waiting to be processed. Any suggestions on how to boost the speed? Thanks for help in advance!!!
推荐答案
简短的答案:把你的数据放在SQL数据库中,并给出时间列索引。你不能用CSV文件 - 使用Pandas不能打败它。
Short answer: put your data in an SQL database, and give the "time" column an index. You can't beat that with CSV files - using Pandas or not.
不改变你的CSV文件,一个快一点,但不是太多的过滤
Without changing your CSV files, one thign a little bit faster, but not much would be to filter the rows as you read them - and have in memory just the ones that are interesting for you.
因此,不是仅仅将整个CSV存储到内存中,而是一个像这样的函数可以做这项工作:
So instead of just getting the whole CSV into memory, a function like such could do the job:
import csv
def filter_time(filename, mintime, maxtime):
timecol = 3
reader = csv.reader(open(filename))
next(reader)
return [line for line in reader if mintime <= line[timecol] <= maxtime]
此任务可以很容易地瘫痪 - 最大的I / O在你的设备上,我猜。一个无痛的方法是使用 lelo
Python包 - 它只是提供了一个 @paralel
装饰器给定的函数在调用时在另一个进程中运行,并为结果返回一个延迟代理。
This task can be easilyt paralyzed - you could get some instances of this running concurrently before maxing the I/O on your device, I'd guess. One painless way to do that would be using the lelo
Python package - it just provides you a @paralel
decorator that makes the given function run in another process when called, and returns a lazy proxy for the results.
但这仍然需要读取所有内容 - 我认为SQL解决方案应该至少快一个数量级。
But that will still have to read everything in - I think the SQL solution should be about at least one order of magnitude faster.
这篇关于从python中的大型csv数据文件中提取少量数据行的高效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!