处理大量的CSV文件 [英] Processing big amount of CSV files

查看:236
本文介绍了处理大量的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我会尝试扩展我的问题的标题。我在红宝石项目上工作。我必须处理大量的数据(大约120000)存储在CSV文件中。我必须读取这些数据,处理并放入DB。现在需要几天。我必须做得更快。问题是有时在处理过程中我得到一些异常,我必须重复整个导入过程。我决定更重要的是提高性能,而不是寻找使用少量数据的错误。现在我坚持使用CSV文件。我决定基准处理脚本找到瓶颈,并提高从CSV加载数据。我看到以下步骤:

I'll try to extend a title of my question. I work on ruby project. I have to process a big amount of data (around 120000) stored in CSV files. I have to read this data, process and put in DB. Now it takes couple days. I have to make this much faster. The problem is that sometimes during processing I get some anomalies and I have to repeat whole import process. I decided that more important is to improve performance instead of looking for bug using small amount of data. For now I stick to CSV files. I decided to benchmark processing script to find bottle necks and improve loading data from CSV. I see following steps:


  1. 基准化并修复最有问题的瓶颈

  2. CSV和处理。例如,创建单独的表和加载数据。

  3. 介绍从CSV加载数据的主题

现在我使用标准的ruby CSV库。你会推荐一些更好的宝石吗?

For now I use standard ruby CSV library. Do you recommend some better gem?

如果有些人在类似的问题上很熟悉。很高兴认识你的意见。

If some of you are familiar in similar problem It would be happy to get to know you opinion.

编辑:

数据库:postgrees

Database: postgrees

系统:linux

推荐答案

我没有机会自己测试,反正我穿过这篇文章,似乎做这项工作。

I haven't had the opportunity to test it myself but refently I crossed this article, seems to do the job.

https://infinum.co/the-capsized-eight/articles/how-to-efficiently-process-large-excel-files-using-ruby

您必须适应使用CSV而不是XLSX。
如果网站将在这里停止代码,以供将来参考。
它通过在数据库中同时写入BATCH_IMPORT_SIZE记录来工作,应该给出巨大的利润。

You'll have to adapt to use CSV instead of XLSX. For future reference if the site would stop here the code. It works by writing BATCH_IMPORT_SIZE records at the database at the same time, should give a huge profit.

class ExcelDataParser
  def initialize(file_path)
    @file_path = file_path
    @records = []
    @counter = 1
  end

  BATCH_IMPORT_SIZE = 1000

  def call
    rows.each do |row|
      increment_counter
      records << build_new_record(row)
      import_records if reached_batch_import_size? || reached_end_of_file?
    end
  end

  private

  attr_reader :file_path, :records
  attr_accessor :counter

  def book
    @book ||= Creek::Book.new(file_path)
  end

  # in this example, we assume that the
  # content is in the first Excel sheet
  def rows
    @rows ||= book.sheets.first.rows
  end

  def increment_counter
    self.counter += 1
  end

  def row_count
    @row_count ||= rows.count
  end

  def build_new_record(row)
    # only build a new record without saving it
    RecordModel.new(...)
  end

  def import_records
    # save multiple records using activerecord-import gem
    RecordModel.import(records)

    # clear records array
    records.clear
  end

  def reached_batch_import_size?
    (counter % BATCH_IMPORT_SIZE).zero?
  end

  def reached_end_of_file?
    counter == row_count
  end
end

https://infinum.co/ the-capsized-eight / articles / how-to-efficient-process-large-excel-files-using-ruby

这篇关于处理大量的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆