使用numpy/pandas读取Python中CSV文件的最后N行 [英] Read the last N lines of a CSV file in Python with numpy / pandas
问题描述
是否可以使用numpy
或pandas
快速读取Python中CSV文件的最后N行?
Is there a quick way to read the last N lines of a CSV file in Python, using numpy
or pandas
?
-
我无法在
numpy
中执行skip_header
或在pandas
中执行skiprow
,因为文件的长度各不相同,因此我总是需要最后N行.
I cannot do
skip_header
innumpy
orskiprow
inpandas
because the length of the file varies, and I would always need the last N rows.
我知道我可以使用纯Python从文件的最后一行逐行读取,但这会非常慢.如果需要的话,我可以这样做,但是使用numpy
或pandas
(本质上是使用C)的更有效的方式将非常感激.
I know I can use pure Python to read line by line from the last row of the file, but that would be very slow. I can do that if I have to, but a more efficient way with numpy
or pandas
(which is essentially using C) would be really appreciated.
推荐答案
使用10行小的测试文件,我尝试了2种方法-解析整个内容并选择最后N行,而不是加载所有行,但仅解析最后一行N:
With a small 10 line test file I tried 2 approaches - parse the whole thing and select the last N lines, versus load all lines, but only parse the last N:
In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 µs per loop
In [1026]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = f.readlines()
...: np.genfromtxt(lines[-5:],delimiter=',')
1000 loops, best of 3: 378 µs per loop
此标签被标记为有效阅读最后一个' CSV的n'行插入DataFrame .那里接受的答案使用了
This was tagged as a duplicate of Efficiently Read last 'n' rows of CSV into DataFrame. The accepted answer there used
from collections import deque
并收集该结构中的最后N行.它还使用StringIO
将行馈送到解析器,这是不必要的复杂操作. genfromtxt
从任何为其输入行的内容中获取输入,因此行列表就可以了.
and collected the last N lines in that structure. It also used StringIO
to feed the lines to the parser, which is an unnecessary complication. genfromtxt
takes input from anything that gives it lines, so a list of lines is just fine.
In [1031]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = deque(f,5)
...: np.genfromtxt(lines,delimiter=',')
1000 loops, best of 3: 382 µs per loop
与readlines
和切片基本上相同的时间.
Basically the same time as readlines
and slice.
deque
可能有一个优势,并且挂在所有行上的开销很大.我认为这不会节省任何文件读取时间.行仍必须一一阅读.
deque
may have an advantage when the file is very large, and it gets costly to hang onto all the lines. I don't think it saves any file reading time. Lines still have to be read one by one.
采用skip_header
方法的row_count
时间更慢;它需要两次读取文件. skip_header
仍然必须读取行.
timings for the row_count
followed by skip_header
approach are slower; it requires reading the file twice. skip_header
still has to read lines.
In [1046]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: ...: reader = csv.reader(f,delimiter = ",")
...: ...: data = list(reader)
...: ...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 µs per loop
出于计算行数的目的,我们不需要使用csv.reader
,尽管看起来并不需要花费太多时间.
For purposes of counting lines we don't need to use csv.reader
, though it doesn't appear to cost much extra time.
In [1048]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: lines=f.readlines()
...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
1000 loops, best of 3: 736 µs per loop
这篇关于使用numpy/pandas读取Python中CSV文件的最后N行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!