使用numpy/pandas读取Python中CSV文件的最后N行 [英] Read the last N lines of a CSV file in Python with numpy / pandas

查看:1036
本文介绍了使用numpy/pandas读取Python中CSV文件的最后N行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用numpypandas快速读取Python中CSV文件的最后N行?

Is there a quick way to read the last N lines of a CSV file in Python, using numpy or pandas?

  1. 我无法在numpy中执行skip_header或在pandas中执行skiprow,因为文件的长度各不相同,因此我总是需要最后N行.

  1. I cannot do skip_header in numpy or skiprow in pandas because the length of the file varies, and I would always need the last N rows.

我知道我可以使用纯Python从文件的最后一行逐行读取,但这会非常慢.如果需要的话,我可以这样做,但是使用numpypandas(本质上是使用C)的更有效的方式将非常感激.

I know I can use pure Python to read line by line from the last row of the file, but that would be very slow. I can do that if I have to, but a more efficient way with numpy or pandas (which is essentially using C) would be really appreciated.

推荐答案

使用10行小的测试文件,我尝试了2种方法-解析整个内容并选择最后N行,而不是加载所有行,但仅解析最后一行N:

With a small 10 line test file I tried 2 approaches - parse the whole thing and select the last N lines, versus load all lines, but only parse the last N:

In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 µs per loop

In [1026]: %%timeit 
      ...: with open('stack38704949.txt','rb') as f:
      ...:      lines = f.readlines()
      ...: np.genfromtxt(lines[-5:],delimiter=',')

1000 loops, best of 3: 378 µs per loop

此标签被标记为有效阅读最后一个' CSV的n'行插入DataFrame .那里接受的答案使用了

This was tagged as a duplicate of Efficiently Read last 'n' rows of CSV into DataFrame. The accepted answer there used

from collections import deque

并收集该结构中的最后N行.它还使用StringIO将行馈送到解析器,这是不必要的复杂操作. genfromtxt从任何为其输入行的内容中获取输入,因此行列表就可以了.

and collected the last N lines in that structure. It also used StringIO to feed the lines to the parser, which is an unnecessary complication. genfromtxt takes input from anything that gives it lines, so a list of lines is just fine.

In [1031]: %%timeit 
      ...: with open('stack38704949.txt','rb') as f:
      ...:      lines = deque(f,5)
      ...: np.genfromtxt(lines,delimiter=',') 

1000 loops, best of 3: 382 µs per loop

readlines和切片基本上相同的时间.

Basically the same time as readlines and slice.

deque可能有一个优势,并且挂在所有行上的开销很大.我认为这不会节省任何文件读取时间.行仍必须一一阅读.

deque may have an advantage when the file is very large, and it gets costly to hang onto all the lines. I don't think it saves any file reading time. Lines still have to be read one by one.

采用skip_header方法的row_count时间更慢;它需要两次读取文件. skip_header仍然必须读取行.

timings for the row_count followed by skip_header approach are slower; it requires reading the file twice. skip_header still has to read lines.

In [1046]: %%timeit 
      ...: with open('stack38704949.txt',"r") as f:
      ...:       ...:     reader = csv.reader(f,delimiter = ",")
      ...:       ...:     data = list(reader)
      ...:       ...:     row_count = len(data)
      ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')

The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 µs per loop

出于计算行数的目的,我们不需要使用csv.reader,尽管看起来并不需要花费太多时间.

For purposes of counting lines we don't need to use csv.reader, though it doesn't appear to cost much extra time.

In [1048]: %%timeit 
      ...: with open('stack38704949.txt',"r") as f:
      ...:    lines=f.readlines()
      ...:    row_count = len(data)
      ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')

1000 loops, best of 3: 736 µs per loop

这篇关于使用numpy/pandas读取Python中CSV文件的最后N行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆