选择第n行作为Pandas DataFrame,而不读取整个文件 [英] Select every nth row as a Pandas DataFrame without reading the entire file

查看:159
本文介绍了选择第n行作为Pandas DataFrame,而不读取整个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取一个大文件,其中包含约950万行x 16列.

I am reading a large file that contains ~9.5 million rows x 16 cols.

我有兴趣检索代表性样本,并且由于数据是按时间组织的,因此我想通过选择第500个元素来做到这一点.

I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.

我能够加载数据,然后选择第500行.

I am able to load the data, and then select every 500th row.

我的问题:是否可以立即读取第500个元素(使用.pd.read_csv()或其他方法),而不必先读取然后过滤数据?

My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

问题2:如果未对date列进行排序,您将如何解决此问题?目前,我假设它是按日期排序的,但是所有数据都容易出错.

Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

以下是数据外观的摘要(前五行)前四行是乱序的,而其余数据集看起来是有序的(按时间):

Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   RatecodeID  store_and_fwd_flag  PULocationID    DOLocationID    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   1   2017-01-09 11:13:28 2017-01-09 11:25:45 1   3.30    1   N   263 161 1   12.5    0.0 0.5 2.00    0.00    0.3 15.30
1   1   2017-01-09 11:32:27 2017-01-09 11:36:01 1   0.90    1   N   186 234 1   5.0 0.0 0.5 1.45    0.00    0.3 7.25
2   1   2017-01-09 11:38:20 2017-01-09 11:42:05 1   1.10    1   N   164 161 1   5.5 0.0 0.5 1.00    0.00    0.3 7.30
3   1   2017-01-09 11:52:13 2017-01-09 11:57:36 1   1.10    1   N   236 75  1   6.0 0.0 0.5 1.70    0.00    0.3 8.50
4   2   2017-01-01 00:00:00 2017-01-01 00:00:00 1   0.02    2   N   249 234 2   52.0    0.0 0.5 0.00    0.00    0.3 52.80

推荐答案

我可以立即读取第500个元素(使用.pd.read_csv()或其他方法),而不必先读取然后过滤数据吗?

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

您可以做的是在 np.arange 长度等于要读取的行数,并使用500th元素/numpy.delete.html"rel =" nofollow noreferrer> np.delete ,所以这样,我们将只读取第500行:

Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:

n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)

这篇关于选择第n行作为Pandas DataFrame,而不读取整个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆