处理大型CSV文件的最佳方法是什么? [英] What is the best way to process large CSV files?

查看:1064
本文介绍了处理大型CSV文件的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个第三方系统,该系统每天都会生成大量数据(这些都是存储在FTP上的CSV文件).生成的文件有3种类型:

    每15分钟
  • (2个文件).这些文件很小(〜2 Mb)
  • 每天下午5点(〜200 - 300 Mb)
  • 每个午夜(此CSV文件大约是1 Gb)

4个CSV的总大小为1.5 Gb.但是我们应该考虑到某些文件每15分钟生成一次.这些数据也应该汇总(不是很困难的过程,但是肯定会需要时间).我需要快速回应. 我正在考虑如何存储这些数据以及整个实施过程.

我们有java个堆栈.数据库是MS SQL Standard.根据我对其他应用程序的测量结果,MS SQL Standard无法处理此类负载.我想到的是:

  • 这可能是使用单独的服务器升级到MS SQL Enterprise的原因.
  • PostgreSQL在另一台服务器上的用法.现在,我正在为此方法进行PoC.

您会在这里推荐什么?也许有更好的选择.

编辑#1

那些大文件是每天的新数据.

解决方案

好的.在花了一些时间解决这个问题之后(包括阅读,咨询,实验,做几个PoC).我想出了以下解决方案.

Tl; dr

数据库:PostgreSQL,因为它非常适合CSV,免费和开放源代码.

工具: Apache Spark 非常适合此类任务.表现不错.

DB

关于数据库,决定是很重要的事情.使用如此大量的数据,该如何选择以及将来如何工作.绝对应该是一个单独的服务器实例,以免在主数据库实例上产生额外的负载并且不阻止其他应用程序.

NoSQL

我在这里考虑了Cassandra的用法,但是这种解决方案现在太复杂了. Cassandra没有临时查询. Cassandra数据存储层基本上是键值存储系统.这意味着您必须对模型进行建模".您的数据围绕所需的查询,而不是数据本身的结构.

RDBMS

我不想在这里过度设计.我在这里停止了选择.

MS SQL Server

这是一个可行的方法,但是最大的缺点是定价.挺贵的.考虑到我们的硬件,企业版要花很多钱.关于定价,您可以阅读此政策文件.

这里的另一个缺点是对CSV文件的支持.这将是我们这里的主要数据源. MS SQL Server既不能导入也不能导出CSV.

  • MS SQL Server默默地截断文本字段.

  • MS SQL Server的文本编码处理出错.

MS SQL Server抛出错误消息,因为它不理解引用或转义. 关于该比较的更多信息可以在文章 PostgreSQL与MS SQL Server 中找到. >

PostgreSQL

该数据库是成熟的产品,也经过了严格的测试.我从其他人那里收到了很多积极的反馈(当然,也有一些权衡取舍).它具有更经典的SQL语法,良好的CSV支持,而且是开源的.

值得一提的是, SSMS 是一种更好的方法比 PGAdmin . SSMS 具有自动完成功能,多个结果(当您运行多个查询时并获得多个结果,但是在 PGAdmin 中,您只会得到最后一个.)

无论如何,现在我正在使用JetBrains的 DataGrip .

处理工具

我浏览了 Spring Batch Spring Batch 太底层了,无法用于此任务,而且 Apache Spark 能够在将来需要时更轻松地扩展.无论如何, Spring Batch 也可以完成这项工作.

关于 Apache Spark 示例,该代码可以在 Apache Spark .

I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:

  • every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
  • everyday at 5 PM (~ 200 - 300 Mb)
  • every midnight (this CSV file is about 1 Gb)

Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses. I am thinking how to store these data and overall on the implementation.

We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:

  • This could be an upgrade to MS SQL Enterprise with the separate server.
  • Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.

What would you recommend here? Probably there are better alternatives.

Edit #1

Those large files are new data for the each day.

解决方案

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.

Tl;dr

Database: PostgreSQL as it is good for CSV, free and open source.

Tool: Apache Spark is a good fit for such type of tasks. Good performance.

DB

Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.

NoSQL

I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.

RDBMS

I didn't want to overengineer here. And I stopped the choice here.

MS SQL Server

It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.

Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.

  • MS SQL Server silently truncating a text field.

  • MS SQL Server's text encoding handling going wrong.

MS SQL Server throwing an error message because it doesn't understand quoting or escaping. More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.

PostgreSQL

This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.

It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).

Anyway, right now I'm using DataGrip from JetBrains.

Processing Tool

I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.

Regarding Apache Spark example, the code could be found in learning-spark project. My choice is Apache Spark for now.

这篇关于处理大型CSV文件的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆