使用 Sidekiq、Resque 等导入 CSV 行块 [英] Importing chunks of CSV rows with Sidekiq, Resque, etc

查看:38
本文介绍了使用 Sidekiq、Resque 等导入 CSV 行块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个将数据从 CSV 文件导入数据库表的导入器.为了避免将整个文件加载到内存中,我使用 Smarter CSV 来解析将文件分成 100 个块,一次加载一个块.

I'm writing an importer that imports data from a CSV file into a DB table. To avoid loading the whole file into memory, I'm using Smarter CSV to parse the file into chunks of 100 to load each chunk one at a time.

我会将每个 100 数据块传递给后台作业处理器(例如 Resque 或 Sidekiq)以批量导入这些行.

I'll be passing each chunk of 100 to a background job processor such as Resque or Sidekiq to import those rows in bulk.

  1. 将 100 行作为作业参数传递会产生大约 5000 个字符长的字符串.这是否会导致任何问题,尤其是后端存储(例如,Sidekiq 使用 Redis - Redis 是否允许存储该长度的键?).我不想一次导入一行,因为它会为 50,000 行文件创建 50,000 个作业.

  1. Passing 100 rows as a job argument results in a string that's about ~5000 characters long. Does this cause any problems in general or particularly with the back-end store (e.g. Sidekiq uses Redis - does Redis allow storing a key of that length?). I don't want to import one row at a time because it creates 50,000 jobs for a 50,000 row file.

我想知道整体导入的进度,所以我计划让每个作业(100 块)更新一个数据库字段,并在完成后将计数增加 1(不确定更好的方法?).由于这些作业并行处理,因此两个作业尝试将同一字段更新 1 并相互覆盖是否有任何危险?还是 DB 写入锁定表,以便一次只能写入一个?

I want to know the progress of the overall import, so I planned to have each job (chunk of 100) update a DB field and increase the count by 1 when it's done (not sure of a better approach?). Since these jobs process in parrallel, is there any danger of two jobs trying to update the same field by 1 and overwriting each other? Or do DB writes lock the table so only one can write at a time?

谢谢!

推荐答案

将 100 行作为作业参数传递会产生大约 5000 个字符长的字符串.

Passing 100 rows as a job argument results in a string that's about ~5000 characters long.

Redis 可以毫无问题地处理这个问题.

Redis can handle that without problems.

由于这些作业是并行处理的,是否有两个作业试图将同一字段更新 1 并相互覆盖的危险?

Since these jobs process in parallel, is there any danger of two jobs trying to update the same field by 1 and overwriting each other?

如果您确实阅读了 + set,那么是的,它会受到竞争条件的影响.您可以将 redis 用于任务并使用其原子 INCR.

If you do read + set, then yes, it's subject to race conditions. You can leverage redis for the task and use its atomic INCR.

为了避免将整个文件加载到内存中,我使用 Smarter CSV 将文件解析为 100 个块

To avoid loading the whole file into memory, I'm using Smarter CSV to parse the file into chunks of 100

取决于您对这些行的处理方式,但我想说,50k 行本身对内存来说并不是很大的压力.

Depends on what you're doing with those rows, but 50k rows by themselves are not a great strain on memory, I'd say.

这篇关于使用 Sidekiq、Resque 等导入 CSV 行块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆