快速处理大量CSV数据的最佳方式 [英] Best way to work with large amounts of CSV data quickly

查看:863
本文介绍了快速处理大量CSV数据的最佳方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大型CSV数据集(10M +行)需要处理。我有两个其他文件需要引用的输出 - 它们包含放大我们知道的CSV文件中的数百万行的数据。目标是输出新的CSV文件,其中每个记录与其他文件中的附加信息合并。



想象一下,大型CSV文件有交易,但客户信息并且计费信息记录在另外两个文件中,我们希望输出一个新的CSV,其中每个事务都链接到客户ID和帐户ID等。



一个用Java编写的函数程序来做到这一点,但是它非常慢。原因是具有数百万行的CSV文件必须经过许多,许多,很多次。



我的问题是 - 是的,我得到了它 - 我该怎么处理这个在Ruby?目标是更快(现在18+小时,非常少的CPU活动)



我可以将这么多记录加载到内存吗?如果是,我应该怎么办?



我知道这有点模糊。只要寻找想法,因为这对我来说有点新。

解决方案

如何使用数据库。
$ b

将记录插入表中,然后使用连接查询它们。



导入可能需要一段时间,为连接和检索部分优化...


I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.

Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.

A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.

My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)

Can I load this many records into memory? If so, how should I do it?

I know this is a little vague. Just looking for ideas as this is a little new to me.

解决方案

how about using a database.

jam the records into tables, and then query them out using joins.

the import might take awhile, but the DB engine will be optimized for the join and retrieval part...

这篇关于快速处理大量CSV数据的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆