并行读写文件是个好主意吗? [英] Is it a good idea to read/write files in parallel?

查看:181
本文介绍了并行读写文件是个好主意吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的数据文件,它们描述了许多气象站的天气.这些文件是每小时一次,并且包含在按日期分隔的文件中.

I have a large number of data files that describe the weather at a large number of weather stations. These files are hourly, and are contained in files separated by date.

例如:

20100101.csv
20100102.csv
20100103.csv
.
.
20140228.csv

我需要通过 station 聚合数据,然后将其写入磁盘.也就是说,对于每个这些天气文件,我都需要提取站点 i 的数据,然后将其写入磁盘.

I need to aggregate the data by station and then write it to disk. That is, for each of those weather files, I need to extract the data for station i and then write it to disk.

输出:

station_001.csv
station_002.csv
.
.
station_999.csv

为了加快速度,我决定使用foreachdoMC软件包并行读取日常文件,并且在通过并行站点聚合之后,还将站点文件写入磁盘.

To speed things up, I decided to read in the daily files in parallel using the foreach and doMC packages, and I also write the station files to disk after aggregating by stations in parallel.

更具体地说,我使用foreach读取文件,并使用.combine="rbind"组合它们(我有足够的内存来在内存中制作一个巨大的数据集).之后,我有另一个foreach循环,在该循环中,我按工作站对数据进行了子集处理,然后写入磁盘.我发现通过并行进行读/写,我获得了很好的速度提升.

More specifically, I used foreach to read in the files and combined them using .combine="rbind" (I have enough memory to make one huge data set in memory). Afterwards, I have another foreach loop where I subset the data by station and then write to disk. I found that I experienced a very good speed boost by doing the read/write in parallel.

我的问题是:并行读写是一个好主意吗?我确保不同的线程不会读取相同的数据文件或写入相同的站文件,但是经过一番谷歌搜索之后,并行化I/O任务似乎不是一个好主意吗? (我发现对并行输入/输出说不的示例,还有一个在R-blogger上发布显示平行读取)

My question is: Is it a good idea to read/write in parallel? I made sure that different threads do not read the same data file or write to the same station file, but after some googling, it seems that it may not be a good idea to parallelize i/o tasks? (Example I found saying no to parallel Input/Output, and a post on R-bloggers showing parallel reads)

推荐答案

Performance Pro

Performance Pro

  • 使用多个线程可以提高多核计算机上的性能

性能指标

  • 从磁盘读取数据时,CPU性能通常不是您的瓶颈.磁盘上的文件通常以尽可能多的顺序块写入.这意味着旋转磁盘上的指针不必移动至读取下一段的距离.如果您并行执行任务,则指针必须反复移动以从中断的地方开始拾取.这意味着您的磁盘写入速度最终将变慢*.

  • When reading from disk, CPU performance is typically not your bottleneck. Files on disk are, more often than not, written in as many sequential blocks as possible. This means that the pointer on your spinning disk does not have to move as far to read the next segment. If you perform the task in parallel, the pointer has to move repeatedly to pick up wherever it left off. This means that your disk write speed will ultimately be slower*.

*固态驱动器可能没有这个问题(我对SSD不太了解,但我想它们根本不受上下文切换的影响).

*Solid-state drives may not have this problem (I don't know much about SSD's, but I imagine they aren't impacted at all by context switching).

这篇关于并行读写文件是个好主意吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆