更好地了解SQLalchemy的yield_per()问题 [英] Better understand SQLalchemy's `yield_per()` problems

查看:246
本文介绍了更好地了解SQLalchemy的yield_per()问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

引用 SQLalchemy文档:

Query.yield_per()方法与大多数急切的加载方案不兼容,包括子查询加载和带有集合的joindload.

警告

请谨慎使用此方法;如果同一实例存在于多于一行的行中,则最终用户对属性的更改将被覆盖.

尤其是,通常无法在急切加载的集合(即任何lazy ="joined"或"subquery")中使用此设置,因为在后续结果批处理中遇到这些集合时,这些集合将被清除以进行新的加载.对于子查询"加载,将获取所有行的完整结果,这通常违背了yield_per()的目的.

还请注意,尽管yield_per()会将stream_results执行选项设置为True,但目前只有psycopg2方言可以理解,它将使用服务器端游标而不是将所有行预先缓冲用于查询而流式传输结果.其他DBAPI在使所有行可用之前会预先缓冲所有行.原始数据库行的内存使用量比ORM映射的对象要少得多,但在进行基准测试时仍应考虑在内.

在理解yield_per()的工作原理以及使用此方法的确切问题时,我确实遇到了问题.解决这些问题并继续使用此功能迭代大量行的正确方法是什么.

我对您拥有的所有建设性信息都很感兴趣,但这是一些提示性问题:

  • 同一行如何有多个实例?仅通过关系(如果迭代表的两行在另一个表的同一行具有FK)?如果您不知道发生这种情况还是只读取关系中的属性,就会有问题吗?
  • lazy ='joined'或'subquery'是不可能的,但是为什么呢?它们都只是查询中调用yield_per()的部分.
    • 如果在随后的结果批次中将其清除,则只需再次加载即可.那么问题出在哪里呢?还是唯一的问题就是如果进行了更改,就失去了关系的更改?
    • 在加载子查询"的情况下,为什么要提取所有行? SQL Server可能不得不保存一个大表,但是为什么不对整个查询简单地一个接一个地分批返回结果呢?
    • yield_per() doc q = sess.query(Object).yield_per(100).options(lazyload('*'), joinedload(Object.some_related))使用lazyload('*')停用了eagerload,但保留了一个单独的合并负载.有没有办法仍然将yield_per()与eagerload一起使用?有什么条件?
  • 他们说psycopg2是唯一支持流结果的DBAPI.那是可以与yield_per()一起使用的唯一DBAPI吗?据我了解,yield_per使用cursor.fetchmany()(示例)功能,其中许多功能都支持这些功能.据我了解,cursor.fetchmany()支持仅获取结果的一部分,而不获取所有内容(如果要获取所有内容,为什么功能存在?)
  • 如果您仅进行读取访问(例如,用于统计数据),我感到yield_per()是完全安全的(即使带有eagerload).正确吗?

解决方案

如果您尝试将它们与yield_per一起使用,那么这两种有问题的加载策略都会引发异常,因此您不必担心太多. >

相信 subqueryload的唯一问题是还没有实现第二个查询的批量加载.语义上不会出错,但是,如果您使用的是yield_per,则可能有充分的理由不希望一次加载所有结果.因此,SQLAlchemy礼貌地拒绝违背您的意愿.

joinedload有点微妙.仅在集合的情况下才禁止这样做,在该集合中,主行可能具有多个关联的行.假设您的查询产生这样的原始结果,其中A和B是来自不同表的主键:

 A | B 
---+---
 1 | 1 
 1 | 2 
 1 | 3 
 1 | 4 
 2 | 5 
 2 | 6 

现在,您可以使用yield_per(3)来获取它们.问题在于,SQLAlchemy只能限制提取的数据量,但必须返回对象.在这里,SQLAlchemy仅看到前三行,因此它创建了一个具有键1和三个 B子代:1、2和3的A对象.

在加载下一个批次时,它想创建一个新的A对象,其键为1 ... ah,但是它已经具有其中一个,因此无需再次创建它.多余的B 4将丢失. (因此,不,即使使用yield_per读取已加入的集合也是不安全的-您的数据块可能会丢失.)

您可能会说:好吧,一直读行直到有完整的对象为止" —但是,如果A有一百个孩子,该怎么办?还是一百万? SQLAlchemy无法合理地保证它可以执行您要求的 并产生正确的结果,因此它拒绝尝试.


请记住,DBAPI旨在使 any 数据库可以与相同的API一起使用,即使该数据库不支持所有DBAPI功能也是如此.考虑一下DBAPI是围绕游标设计的,但是MySQL实际上并没有拥有游标! MySQL的DBAPI适配器必须伪造它们.

因此,尽管cursor.fetchmany(100)将正常工作,但您可以从SQLalchemy documentation:

The Query.yield_per() method is not compatible with most eager loading schemes, including subqueryload and joinedload with collections.

Warning

Use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.

In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=’joined’ or ‘subquery’) since those collections will be cleared for a new load when encountered in a subsequent result batch. In the case of ‘subquery’ loading, the full result for all rows is fetched which generally defeats the purpose of yield_per().

Also note that while yield_per() will set the stream_results execution option to True, currently this is only understood by psycopg2 dialect which will stream results using server side cursors instead of pre-buffer all rows for this query. Other DBAPIs pre-buffer all rows before making them available. The memory use of raw database rows is much less than that of an ORM-mapped object, but should still be taken into consideration when benchmarking.

I really have a problem understanding how yield_per() works and what exactly is the problem on using this method. Also what is the right way to workaround these problems and keep using this function for iterating over a huge amount of rows.

I'm interested in all constructive information you have, but here are some hint questions:

  • How can there be multiple instances of the same row? Only through relationships (if two rows of the iterating table have an FK to the same row in another table)? Is there a problem if you do not know that it happens or you only read attributes on the relationships?
  • lazy=’joined’ or ‘subquery’ are not possible, but why exactly? Both of them are simply parts of your query on which you call yield_per().
    • If they are cleared in a subsequent result batch, then simply load it again. So where is the problem? Or is the only problem that you loose the changes of you're relationships if have made changes?
    • In the case of a ‘subquery’ loading, why all rows are fetched? The SQL Server may have to save a big table, but then why not simply return the result in batches one after the other for the entire query?
    • In an example in the yield_per() doc q = sess.query(Object).yield_per(100).options(lazyload('*'), joinedload(Object.some_related)) they deactivate eagerload with lazyload('*') but keep a single joined load. Is there a way to still use yield_per() with eagerload? What are the conditions?
  • They say psycopg2 is the only DBAPI which support stream results. So is that the only DBAPI which you can use with yield_per()? As far as I understand yield_per uses the cursor.fetchmany() (example) function of DBAPI which support many of them. And as far as I understand cursor.fetchmany() supports fetching only parts of the result and does not fetch everything (If it would fetch everything, why the function exists?)
  • I have the feeling that yield_per() is entirely safe (even with eagerload) if you only do read access (for example for statistics). Is that correct?

解决方案

Both of the problematic loading strategies raise exceptions if you try to use them with yield_per, so you don't really have to worry too much.

I believe the only problem with subqueryload is that batched loading of the second query isn't implemented (yet). Nothing would go wrong semantically, but if you're using yield_per, you probably have a really good reason not to want to load all the results at once. So SQLAlchemy politely refuses to go against your wishes.

joinedload is a little more subtle. It's only forbidden in the case of a collection, where a primary row might have multiple associated rows. Say your query produces raw results like this, where A and B are primary keys from different tables:

 A | B 
---+---
 1 | 1 
 1 | 2 
 1 | 3 
 1 | 4 
 2 | 5 
 2 | 6 

Now you fetch these with yield_per(3). The problem is that SQLAlchemy can only limit how much it fetches by rows, but it has to return objects. Here, SQLAlchemy only sees the first three rows, so it creates an A object with key 1 and three B children: 1, 2, and 3.

When it loads the next batch, it wants to create a new A object with key 1... ah, but it already has one of those, so no need to create it again. The extra B, 4, is lost. (So, no, even reading joined collections with yield_per is unsafe — chunks of your data might go missing.)

You might say "well, just keep reading rows until you have a full object" — but what if that A has a hundred children? Or a million? SQLAlchemy can't reasonably guarantee that it can do what you asked and produce correct results, so it refuses to try.


Remember that the DBAPI is designed so that any database can be used with the same API, even if that database doesn't support all the DBAPI features. Consider that the DBAPI is designed around cursors, but MySQL doesn't actually have cursors! The DBAPI adapters for MySQL have to fake them, instead.

So while cursor.fetchmany(100) will work, you can see from the MySQLdb source code that it doesn't fetch lazily from the server; it fetches everything into one big list, then returns a slice when you call fetchmany.

What psycopg2 supports is true streaming, where the results are remembered persistently on the server, and your Python process only sees a few of them at a time.

You can still use yield_per with MySQLdb, or any other DBAPI; that's the whole point of the DBAPI's design. You'll have to pay the memory cost for all the raw rows hidden away in the DBAPI (which are tuples, fairly cheap), but you won't also have to pay for all of the ORM objects at the same time.

这篇关于更好地了解SQLalchemy的yield_per()问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆