导入大型CSV文件的最佳做法 [英] Best practices for importing large CSV files

查看:314
本文介绍了导入大型CSV文件的最佳做法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的公司每个月都会获得一组包含银行帐户信息的CSV文件,我需要将其导入数据库。其中一些文件可能非常大。例如,一个是大约33MB和大约65,000行。

My company gets a set of CSV files full of bank account info each month that I need to import into a database. Some of these files can be pretty big. For example, one is about 33MB and about 65,000 lines.

现在我有一个symfony / Doctrine应用程序(PHP)读取这些CSV文件并将它们导入数据库。我的数据库有大约35个不同的表,在导入过程中,我将这些行拆分为组成对象并将它们插入到数据库中。这一切都很漂亮,除了(每行大约需要四分之一秒)并且它使用了大量的内存。

Right now I have a symfony/Doctrine app (PHP) that reads these CSV files and imports them into a database. My database has about 35 different tables and on the process of importing, I take these rows, split them up into their constituent objects and insert them into the database. It all works beautifully, except it's slow (each row takes about a quarter second) and it uses a lot of memory.

内存使用太糟糕了,我必须拆分我的CSV文件。一个20,000行的文件几乎没有进入。当它接近结束时,我的内存使用率达95%。导入65,000行文件根本不可能。

The memory use is so bad that I have to split up my CSV files. A 20,000-line file barely makes it in. By the time it's near the end, I'm at like 95% memory usage. Importing that 65,000 line file is simply not possible.

我发现symfony是构建应用程序的特殊框架,我通常不会考虑使用其他任何东西,但是在这种情况下,我愿意以性能的名义将所有的先入之见抛到窗外。我没有承诺任何特定的语言,DBMS或其他任何东西。

I've found symfony to be an exceptional framework for building applications and I normally wouldn't consider using anything else, but in this case I'm willing to throw all my preconceptions out the window in the name of performance. I'm not committed to any specific language, DBMS, or anything.

Stack Overflow不喜欢主观问题,所以我将尝试将其作为un - 尽可能的主观:对于那些不仅仅有意见但经验导入大型CSV文件的人,您过去使用过哪些成功的工具/做法?

Stack Overflow doesn't like subjective questions so I'm going to try to make this as un-subjective as possible: for those of you have not just an opinion but experience importing large CSV files, what tools/practices have you used in the past that have been successful?

例如,您是否只使用Django的ORM / OOP并且您没有遇到任何问题?或者你是否将整个CSV文件读入内存并准备了一些巨大的 INSERT 语句?

For example, do you just use Django's ORM/OOP and you haven't had any problems? Or do you read the entire CSV file into memory and prepare a few humongous INSERT statements?

再次,我想要不仅仅是一种意见,而是过去实际上对你有用的东西。

Again, I want not just an opinion, but something that's actually worked for you in the past.

编辑:我不只是将85列CSV电子表格导入一个85列数据库表。我正在将数据规范化并将其放入几十个不同的表中。出于这个原因,我不能只使用 LOAD DATA INFILE (我正在使用MySQL)或任何其他只读取CSV文件的DBMS功能。

I'm not just importing an 85-column CSV spreadsheet into one 85-column database table. I'm normalizing the data and putting it into dozens of different tables. For this reason, I can't just use LOAD DATA INFILE (I'm using MySQL) or any other DBMS's feature that just reads in CSV files.

此外,我不能使用任何特定于Microsoft的解决方案。

Also, I can't use any Microsoft-specific solutions.

推荐答案

我有这个完全相同的问题大约2个星期前。我写了一些.NET来做ROW BY ROW插入,根据我的计算得到的数据量,用这种方式需要一周左右。

I had this exact same problem about 2 weeks ago. I wrote some .NET to do ROW BY ROW inserts and by my calculations with the amount of data I had, it would take around a week to this it this way.

因此,我使用字符串构建器创建一个巨大的查询并将其一次性发送到我的关系系统。它从用了一个星期到用了5分钟。现在我不知道你正在使用什么样的关系系统,但是如果有大量的查询,你可能需要调整你的max_allowed_pa​​cket参数或者类似的。

So instead I used a string builder to create one HUGE query and sent it to my relational system all at once. It went from taking a week to taking 5 minutes. Now I don't know what relational system you are using, but with enormous queries you'll probably have to tweak your max_allowed_packet param or similar.

这篇关于导入大型CSV文件的最佳做法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆