使用Sqlite计数器并行配置数百万个文本文件? [英] Profile Millions of Text Files In Parallel Using An Sqlite Counter?

查看：93 发布时间：2020/5/13 20:17:09 python python-3.x sqlite multiprocessing

本文介绍了使用Sqlite计数器并行配置数百万个文本文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

一大堆文本文件(类型A，B和C)正坐在我的胸口上，慢慢地，冷冷地拒绝了我急需的空气.这些年来，每个类型规范都有 enhancement ，因此昨天的typeA文件比去年的typeA具有更多的属性.为了构建可以处理这些文件类型十年来长期发展的解析器，有意义的是，迭代，冷静地检查所有1400万个文件类型，但要在它们的重量之前丧命.

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.

我建立了一个运行计数器，以便每次看到属性(熟悉或不熟悉)时，都对其计数加1. sqlite理货单看起来像这样:

I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:

在特殊事件中，我看到一个陌生的属性，将它们添加到计数中.在看起来像这样的typeA文件上:

In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:

我已经关闭了该系统！但是在一个过程中，@ 3M文件/36小时的速度很慢.最初，我是使用此技巧来传递sqlite需要增加的属性列表的.

I've got this system down! But it's slow @ 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)

我知道这是个坏主意，因为

I learned that's a bad idea because

sqlite字符串搜索比索引搜索慢得多
数百个属性(长约160个字符)构成了真正长条SQL查询
使用%s代替?是不安全的做法...(不是关注的ATM)

sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)

一个修复程序"是维护此循环中使用的演算符的脚本端property-rowid哈希:

A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:

读取new_properties
读取tally_board以获得rowid，property
根据2的读取结果生成脚本端client_hash
对于每个不在property中的new_property，将行写入tally_board(尚未增加).使用新属性更新client_hash
使用client_hash

new_properties

rowid

将增量写入每个rowid(现在是property的代理)到tally_board

Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board

步骤6看起来像

sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute

问题是

它仍然很慢！
它在并行处理中表现出竞争状态，每当线程A在线程B完成步骤6之前开始步骤2时，都会在property列中引入重复项.

It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.

一种解决竞争状况的方法是，对第2-6步赋予数据库独占锁~~，尽管看起来读操作无法获得这些锁~~

A solution to the race condition is to give steps 2-6 an exclusive lock on the db ~~though it doesn't look like reads can get those~~ Lock A Read.

另一种尝试使用正版 UPSERT递增现有的property行并插入(并递增)新的property一口气.

Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.

类似的东西可能会很幸运，但是我不确定如何重写它以增加计数. /p>

There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.

使用Sqlite计数器并行配置数百万个文本文件? [英] Profile Millions of Text Files In Parallel Using An Sqlite Counter?

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

使用Sqlite计数器并行配置数百万个文本文件? [英] Profile Millions of Text Files In Parallel Using An Sqlite Counter?

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭