在包含30亿行的文件中寻找第十亿行。 [英] Seek the one billionth line in a file containing 3 billion lines.

查看:96
本文介绍了在包含30亿行的文件中寻找第十亿行。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的日志文件,其中包含3,453,299,000行和

不同长度。无法计算十亿分之一行开头的绝对价值。有没有

在python中寻找到该行开头的有效方法?


这个程序:

for i in range (1000000000):

f.readline()

绝对是每个慢......


非常感谢你帮助。

I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.

推荐答案

Sullivan WxPyQtKinter< su *********** @ gmail.comwrites:
Sullivan WxPyQtKinter <su***********@gmail.comwrites:

此程序:

for i in range(1000000000):

f.readline()

绝对是每一个慢....
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....



有两个问题:


1)范围(1000000000) )在内存中构建一个十亿个元素的列表,

这是几千兆字节,可能会颠覆你的机器。

你想使用xrange而不是range,它构建一个迭代器

(即仅使用少量内存的东西,并且

即时生成值而不是预先计算列表)。


2)f.readline()读取整行输入(取决于

日志文件的性质)也可能非常大。

如果您确定日志文件内容合理(排队等于几百兆字节不应该导致问题)那么您可以这样做

那样,但你想读取固定大小的单位。

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you''re sure the log file contents are sensible (lines up to
several megabytes shouldn''t cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.


2007年8月7日,沙利文WxPyQtKinter< su ******* **** @ gmail.com写道:
On 8/7/07, Sullivan WxPyQtKinter <su***********@gmail.comwrote:

我有一个巨大的日志文件,其中包含3,453,299,000行和

不同长度。无法计算十亿分之一行开头的绝对价值。有没有

在python中寻找到该行开头的有效方法?


这个程序:

for i in range (1000000000):

f.readline()

绝对是每个慢......


非常感谢你救命。
I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.



没有快速的方法可以做到这一点,除非线是固定长度的

(在这种情况下你可以使用f.seek ()移动到正确的位置)。

的原因是,如果不扫描整个文件,就无法找到第十亿
行的位置。您将来应该将日志拆分为

较小的文件。


您可以通过使用split
实用程序并让它将日志文件拆分成更小的块(拆分可以按行数拆分
),但是因为那仍然需要扫描文件它

将是IO界限。


-

Evan Klitzke< ev ** @ yelp.com>

There is no fast way to do this, unless the lines are of fixed length
(in which case you can use f.seek() to move to the correct spot). The
reason is that there is no way to find the position of the billionth
line without scanning the whole file. You should split your logs into
smaller files in the future.

You might be able to do this a very tiny bit faster by using the split
utility and have it split the log file into smaller chunks (split can
split by line amounts), but since that still has to scan the file it
will will be IO bound.

--
Evan Klitzke <ev**@yelp.com>


8月8日凌晨2:35,Paul Rubin< http://phr...@NOSPAM.invalidwrote:
On Aug 8, 2:35 am, Paul Rubin <http://phr...@NOSPAM.invalidwrote:

Sullivan WxPyQtKinter< sullivanz .... @ gmail.comwrites:
Sullivan WxPyQtKinter <sullivanz....@gmail.comwrites:

此程序:

for i in range(1000000000):

f.readline()

绝对是每个慢....
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....



有两个问题:


1)范围(1000000000)在内存中构建十亿个元素的列表,

这是几千兆字节和概率y thrashing你的机器。

你想使用xrange而不是range,它构建一个迭代器

(即只使用少量内存的东西,并且
即时生成值而不是预先计算列表。


2)f.readline( )读取整行输入(取决于

日志文件的性质)也可能是非常大的。

如果你确定日志文件内容是明智的(排队等于
几兆字节应该不会造成问题)然后你就可以这样做了b / b
,否则你想读固定大小单位。


There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you''re sure the log file contents are sensible (lines up to
several megabytes shouldn''t cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.



感谢您指出这两个问题。我写了这个程序

只是为了说使用看似天然的方式来寻找一个如此大的文件是多么低效。没有其他意图........


Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........


这篇关于在包含30亿行的文件中寻找第十亿行。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆