处理大量的CSV文件 [英] Processing big amount of CSV files
问题描述
我会尝试扩展我的问题的标题。我在红宝石项目上工作。我必须处理大量的数据(大约120000)存储在CSV文件中。我必须读取这些数据,处理并放入DB。现在需要几天。我必须做得更快。问题是有时在处理过程中我得到一些异常,我必须重复整个导入过程。我决定更重要的是提高性能,而不是寻找使用少量数据的错误。现在我坚持使用CSV文件。我决定基准处理脚本找到瓶颈,并提高从CSV加载数据。我看到以下步骤:
I'll try to extend a title of my question. I work on ruby project. I have to process a big amount of data (around 120000) stored in CSV files. I have to read this data, process and put in DB. Now it takes couple days. I have to make this much faster. The problem is that sometimes during processing I get some anomalies and I have to repeat whole import process. I decided that more important is to improve performance instead of looking for bug using small amount of data. For now I stick to CSV files. I decided to benchmark processing script to find bottle necks and improve loading data from CSV. I see following steps:
- 基准化并修复最有问题的瓶颈
- CSV和处理。例如,创建单独的表和加载数据。
- 介绍从CSV加载数据的主题
现在我使用标准的ruby CSV库。你会推荐一些更好的宝石吗?
For now I use standard ruby CSV library. Do you recommend some better gem?
如果有些人在类似的问题上很熟悉。很高兴认识你的意见。
If some of you are familiar in similar problem It would be happy to get to know you opinion.
编辑:
数据库:postgrees
Database: postgrees
系统:linux
推荐答案
我没有机会自己测试,反正我穿过这篇文章,似乎做这项工作。
I haven't had the opportunity to test it myself but refently I crossed this article, seems to do the job.
您必须适应使用CSV而不是XLSX。
如果网站将在这里停止代码,以供将来参考。
它通过在数据库中同时写入BATCH_IMPORT_SIZE记录来工作,应该给出巨大的利润。
You'll have to adapt to use CSV instead of XLSX. For future reference if the site would stop here the code. It works by writing BATCH_IMPORT_SIZE records at the database at the same time, should give a huge profit.
class ExcelDataParser
def initialize(file_path)
@file_path = file_path
@records = []
@counter = 1
end
BATCH_IMPORT_SIZE = 1000
def call
rows.each do |row|
increment_counter
records << build_new_record(row)
import_records if reached_batch_import_size? || reached_end_of_file?
end
end
private
attr_reader :file_path, :records
attr_accessor :counter
def book
@book ||= Creek::Book.new(file_path)
end
# in this example, we assume that the
# content is in the first Excel sheet
def rows
@rows ||= book.sheets.first.rows
end
def increment_counter
self.counter += 1
end
def row_count
@row_count ||= rows.count
end
def build_new_record(row)
# only build a new record without saving it
RecordModel.new(...)
end
def import_records
# save multiple records using activerecord-import gem
RecordModel.import(records)
# clear records array
records.clear
end
def reached_batch_import_size?
(counter % BATCH_IMPORT_SIZE).zero?
end
def reached_end_of_file?
counter == row_count
end
end
这篇关于处理大量的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!