使用Sqlite计数器并行配置数百万个文本文件? [英] Profile Millions of Text Files In Parallel Using An Sqlite Counter?

查看:93
本文介绍了使用Sqlite计数器并行配置数百万个文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一大堆文本文件(类型A,B和C)正坐在我的胸口上,慢慢地,冷冷地拒绝了我急需的空气.这些年来,每个类型规范都有 enhancement ,因此昨天的typeA文件比去年的typeA具有更多的属性.为了构建可以处理这些文件类型十年来长期发展的解析器,有意义的是,迭代,冷静地检查所有1400万个文件类型,但要在它们的重量之前丧命.

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.

我建立了一个运行计数器,以便每次看到属性(熟悉或不熟悉)时,都对其计数加1. sqlite理货单看起来像这样:

I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:

在特殊事件中,我看到一个陌生的属性,将它们添加到计数中.在看起来像这样的typeA文件上:

In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:

我已经关闭了该系统!但是在一个过程中,@ 3M文件/36小时的速度很慢.最初,我是使用此技巧来传递sqlite需要增加的属性列表的.

I've got this system down! But it's slow @ 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)

我知道这是个坏主意,因为

I learned that's a bad idea because

  1. sqlite字符串搜索比索引搜索慢得多
  2. 数百个属性(长约160个字符)构成了真正长条SQL查询
  3. 使用%s代替?是不安全的做法...(不是关注的ATM)
  1. sqlite string search is much slower than indexed search
  2. several hundreds of properties (some 160 characters long) make for really long sql queries
  3. using %s instead of ? is bad security practice... (not a concern ATM)

一个修复程序"是维护此循环中使用的演算符的脚本端property-rowid哈希:

A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:

  1. 读取new_properties
  2. 的文件
  3. 读取tally_board以获得rowidproperty
  4. 根据2的读取结果生成脚本端client_hash
  5. 对于每个不在property中的new_property,将行写入tally_board(尚未增加).使用新属性更新client_hash
  6. 使用client_hash
  7. new_properties中的每一行上查找rowid
  8. 将增量写入每个rowid(现在是property的代理)到tally_board
  1. Read file for new_properties
  2. Read tally_board for rowid, property
  3. Generate script side client_hash from 2's read
  4. Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
  5. Lookup rowid for every row in new_properties using the client_hash
  6. Write increment to every rowid (now a proxy for property) to tally_board

步骤6看起来像

sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute

问题是

  • 它仍然很慢!
  • 它在并行处理中表现出竞争状态,每当线程A在线程B完成步骤6之前开始步骤2时,都会在property列中引入重复项.
  • It's still slow!
  • It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.

一种解决竞争状况的方法是,对第2-6步赋予数据库独占锁,尽管看起来读操作无法获得这些锁

A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.

另一种尝试使用正版 UPSERT递增现有的property行并插入(并递增)新的property一口气.

Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.

类似的东西可能会很幸运,但是我不确定如何重写它以增加计数. /p>

There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.

推荐答案

以下内容极大地加快了工作速度:

The following sped things up immensely:

  • 减少写到SQLite的频率.将我的大部分中间结果保存在内存中,然后每50k个文件使用它们更新一次数据库,这大约导致执行时间的三分之一(35小时至11.5小时)
  • 将数据移到我的PC上(由于某种原因,我的USB3.0端口正在传输远远低于USB2.0速率的数据).这大约占执行时间的五分之一(11.5小时至2.5小时).
  • Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
  • Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).

这篇关于使用Sqlite计数器并行配置数百万个文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆