SQLAlchemy:使用 ORM 扫描大表? [英] SQLAlchemy: Scan huge tables using ORM?
问题描述
我目前正在尝试使用 SQLAlchemy,这真的很不错.
I am currently playing around with SQLAlchemy a bit, which is really quite neat.
为了测试,我创建了一个包含我的图片存档的大表,由 SHA1 哈希索引(以删除重复项:-)).这是令人印象深刻的快...
For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...
为了好玩,我对生成的 SQLite 数据库做了相当于 select *
的操作:
For fun I did the equivalent of a select *
over the resulting SQLite database:
session = Session()
for p in session.query(Picture):
print(p)
我希望看到散列滚动,但它只是继续扫描磁盘.与此同时,内存使用量暴涨,几秒钟后达到1GB.这似乎来自 SQLAlchemy 的身份映射功能,我认为它只是保留弱引用.
I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.
有人可以向我解释一下吗?还以为写出hash后会收集每张图片p!?
Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?
推荐答案
好的,我刚刚找到了自己解决这个问题的方法.将代码更改为
Okay, I just found a way to do this myself. Changing the code to
session = Session()
for p in session.query(Picture).yield_per(5):
print(p)
一次只加载 5 张图片.默认情况下,查询似乎会一次加载所有行.但是,我还不明白该方法的免责声明.引用自 SQLAlchemy 文档
loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs
警告:谨慎使用此方法;如果同一实例存在于多个行中,最终用户对属性的更改将被覆盖.特别是,通常不可能将此设置用于急切加载的集合(即 any lazy=False),因为在后续结果批次中遇到这些集合时,这些集合将被清除以进行新的加载.
WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten. In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.
因此,如果使用 yield_per
实际上是 在使用 ORM 时扫描大量 SQL 数据的正确方法 (tm),那么何时使用它是安全的?
So if using yield_per
is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?
这篇关于SQLAlchemy:使用 ORM 扫描大表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!