从python中的大型csv数据文件中提取少量数据行的高效方法 [英] efficient way to extract few lines of data from a large csv data file in python

查看:1091
本文介绍了从python中的大型csv数据文件中提取少量数据行的高效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的csv数据文件,每个数据文件包含几天的价格数据,一个代码的形式如下:

I have a large number of csv data files, and each data file contains several days worth of tick data for one ticker in the following form :

 ticker  DD/MM/YYYY    time         bid      ask
  XXX,   19122014,  08:00:08.325,  9929.00,9933.00
  XXX,   19122014,  08:00:08.523,  9924.00,9931.00
  XXX,   19122014,  08:00:08.722,  9925.00,9930.50
  XXX,   19122014,  08:00:08.921,  9924.00,9928.00
  XXX,   19122014,  08:00:09.125,  9924.00,9928.00
  …
  XXX,   30122014,  21:56:25.181,  9795.50,9796.50
  XXX,   30122014,  21:56:26.398,  9795.50,9796.50
  XXX,   30122014,  21:56:26.598,  9795.50,9796.50
  XXX,   30122014,  21:56:26.798,  9795.50,9796.50
  XXX,   30122014,  21:56:28.896,  9795.50,9796.00
  XXX,   30122014,  21:56:29.096,  9795.50,9796.50
  XXX,   30122014,  21:56:29.296,  9795.50,9796.00
  …

我需要提取时间在一定范围内的任何行数据,例如:09:00:00至09:15:00。我目前的解决方案是简单地读取每个数据文件到一个数据帧,按时间顺序排序,然后使用searchsorted找到09:00:00到09:15:00。它工作正常,如果性能不是一个问题,我没有1000个文件等待处理。任何建议如何提高速度?

I need to extract any lines of data whose time is within certain range, say: 09:00:00 to 09:15:00. My current solution is simply reading in each data file to a data frame, sorting it in order by time and then using searchsorted to find 09:00:00 to 09:15:00. It works fine if performance isn't an issue and I don't have 1000 files waiting to be processed. Any suggestions on how to boost the speed? Thanks for help in advance!!!

推荐答案

简短的答案:把你的数据放在SQL数据库中,并给出时间列索引。你不能用CSV文件 - 使用Pandas不能打败它。

Short answer: put your data in an SQL database, and give the "time" column an index. You can't beat that with CSV files - using Pandas or not.

不改变你的CSV文件,一个快一点,但不是太多的过滤

Without changing your CSV files, one thign a little bit faster, but not much would be to filter the rows as you read them - and have in memory just the ones that are interesting for you.

因此,不是仅仅将整个CSV存储到内存中,而是一个像这样的函数可以做这项工作:

So instead of just getting the whole CSV into memory, a function like such could do the job:

import csv

def filter_time(filename, mintime, maxtime):
    timecol = 3
    reader = csv.reader(open(filename))
    next(reader)
    return [line for line in reader if mintime <= line[timecol] <= maxtime]



此任务可以很容易地瘫痪 - 最大的I / O在你的设备上,我猜。一个无痛的方法是使用 lelo Python包 - 它只是提供了一个 @paralel 装饰器给定的函数在调用时在另一个进程中运行,并为结果返回一个延迟代理。

This task can be easilyt paralyzed - you could get some instances of this running concurrently before maxing the I/O on your device, I'd guess. One painless way to do that would be using the lelo Python package - it just provides you a @paralel decorator that makes the given function run in another process when called, and returns a lazy proxy for the results.

但这仍然需要读取所有内容 - 我认为SQL解决方案应该至少快一个数量级。

But that will still have to read everything in - I think the SQL solution should be about at least one order of magnitude faster.

这篇关于从python中的大型csv数据文件中提取少量数据行的高效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆