为什么 contains() 运算符会如此显着地降低实体框架的性能? [英] Why does the Contains() operator degrade Entity Framework's performance so dramatically?

查看:24
本文介绍了为什么 contains() 运算符会如此显着地降低实体框架的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新 3:根据 此公告,EF 团队已在 EF6 alpha 2 中解决此问题.

UPDATE 3: According to this announcement, this has been addressed by the EF team in EF6 alpha 2.

更新 2:我提出了解决此问题的建议.要投票给它,去这里.

UPDATE 2: I've created a suggestion to fix this problem. To vote for it, go here.

考虑一个带有一个非常简单的表的 SQL 数据库.

Consider a SQL database with one very simple table.

CREATE TABLE Main (Id INT PRIMARY KEY)

我用 10,000 条记录填充表格.

I populate the table with 10,000 records.

WITH Numbers AS
(
  SELECT 1 AS Id
  UNION ALL
  SELECT Id + 1 AS Id FROM Numbers WHERE Id <= 10000
)
INSERT Main (Id)
SELECT Id FROM Numbers
OPTION (MAXRECURSION 0)

我为表构建了一个 EF 模型并在 LINQPad 中运行以下查询(我使用的是C# 语句"模式,因此 LINQPad 不会自动创建转储).

I build an EF model for the table and run the following query in LINQPad (I am using "C# Statements" mode so LINQPad doesn't create a dump automatically).

var rows = 
  Main
  .ToArray();

执行时间约为 0.07 秒.现在我添加 Contains 运算符并重新运行查询.

Execution time is ~0.07 seconds. Now I add the Contains operator and re-run the query.

var ids = Main.Select(a => a.Id).ToArray();
var rows = 
  Main
  .Where (a => ids.Contains(a.Id))
  .ToArray();

此案例的执行时间为20.14 秒(慢 288 倍)!

Execution time for this case is 20.14 seconds (288 times slower)!

起初我怀疑为查询发出的 T-SQL 需要更长的时间来执行,所以我尝试将其从 LINQPad 的 SQL 窗格剪切并粘贴到 SQL Server Management Studio.

At first I suspected that the T-SQL emitted for the query was taking longer to execute, so I tried cutting and pasting it from LINQPad's SQL pane into SQL Server Management Studio.

SET NOCOUNT ON
SET STATISTICS TIME ON
SELECT 
[Extent1].[Id] AS [Id]
FROM [dbo].[Primary] AS [Extent1]
WHERE [Extent1].[Id] IN (1,2,3,4,5,6,7,8,...

结果是

SQL Server Execution Times:
  CPU time = 0 ms,  elapsed time = 88 ms.

接下来我怀疑是 LINQPad 导致了问题,但无论我在 LINQPad 还是在控制台应用程序中运行它,性能都是一样的.

Next I suspected LINQPad was causing the problem, but performance is the same whether I run it in LINQPad or in a console application.

因此,问题似乎出在实体框架内.

So, it appears that the problem is somewhere within Entity Framework.

我在这里做错了吗?这是我的代码的时间关键部分,那么我可以做些什么来提高性能?

Am I doing something wrong here? This is a time-critical part of my code, so is there something I can do to speed up performance?

我使用的是实体框架 4.1 和 Sql Server 2008 R2.

I am using Entity Framework 4.1 and Sql Server 2008 R2.

更新 1:

在下面的讨论中,有一些关于延迟是在 EF 构建初始查询时还是在解析收到的数据时发生的问题.为了测试这一点,我运行了以下代码,

In the discussion below there were some questions about whether the delay occurred while EF was building the initial query or while it was parsing the data it received back. To test this I ran the following code,

var ids = Main.Select(a => a.Id).ToArray();
var rows = 
  (ObjectQuery<MainRow>)
  Main
  .Where (a => ids.Contains(a.Id));
var sql = rows.ToTraceString();

强制 EF 生成查询而不对数据库执行查询.结果是这段代码需要大约 20 秒的时间才能运行,所以看起来几乎所有的时间都花在了构建初始查询上.

which forces EF to generate the query without executing it against the database. The result was that this code required ~20 secords to run, so it appears that almost all of the time is taken in building the initial query.

CompiledQuery 来救援呢?没那么快……CompiledQuery 要求传入查询的参数是基本类型(int、string、float 等).它不接受数组或 IEnumerable,所以我不能将它用于 Id 列表.

CompiledQuery to the rescue then? Not so fast ... CompiledQuery requires the parameters passed into the query to be fundamental types (int, string, float, and so on). It won't accept arrays or IEnumerable, so I can't use it for a list of Ids.

推荐答案

更新:通过在 EF6 中添加 InExpression,处理 Enumerable.Contains 的性能显着提高.不再需要此答案中描述的方法.

您说得对,大部分时间都花在处理查询的翻译上.EF 的提供程序模型当前不包含表示 IN 子句的表达式,因此 ADO.NET 提供程序不能本机支持 IN.相反,Enumerable.Contains 的实现将其转换为 OR 表达式树,即对于 C# 中的某些内容,如下所示:

You are right that most of the time is spent processing the translation of the query. EF's provider model doesn't currently include an expression that represents an IN clause, therefore ADO.NET providers can't support IN natively. Instead, the implementation of Enumerable.Contains translates it to a tree of OR expressions, i.e. for something that in C# looks like like this:

new []{1, 2, 3, 4}.Contains(i)

...我们将生成一个 DbExpression 树,可以这样表示:

... we will generate a DbExpression tree that could be represented like this:

((1 = @i) OR (2 = @i)) OR ((3 = @i) OR (4 = @i))

(表达式树必须是平衡的,因为如果我们在一个长脊上有所有的 OR,那么表达式访问者就会有更多的机会遇到堆栈溢出(是的,我们实际上在我们的测试中确实遇到了这个问题))

(The expression trees have to be balanced because if we had all the ORs over a single long spine there would be more chances that the expression visitor would hit a stack overflow (yes, we actually did hit that in our testing))

我们稍后将这样的树发送给 ADO.NET 提供程序,它可以识别这种模式并在 SQL 生成期间将其简化为 IN 子句.

We later send a tree like this to the ADO.NET provider, which can have the ability to recognize this pattern and reduce it to the IN clause during SQL generation.

当我们在 EF4 中添加对 Enumerable.Contains 的支持时,我们认为不需要在提供者模型中引入对 IN 表达式的支持是可取的,老实说,10,000 比我们预期的客户元素数量多得多将传递给 Enumerable.Contains.也就是说,我知道这是一个烦恼,并且在您的特定场景中,表达式树的操作会使事情变得过于昂贵.

When we added support for Enumerable.Contains in EF4, we thought it was desirable to do it without having to introduce support for IN expressions in the provider model, and honestly, 10,000 is much more than the number of elements we anticipated customers would pass to Enumerable.Contains. That said, I understand that this is an annoyance and that the manipulation of expressions trees makes things too expensive in your particular scenario.

我与我们的一位开发人员讨论了这个问题,我们相信将来我们可以通过添加一流的 IN 支持来更改实现.我会确保将其添加到我们的积压工作中,但我无法保证何时会实现,因为我们还想进行许多其他改进.

I discussed this with one of our developers and we believe that in the future we could change the implementation by adding first-class support for IN. I will make sure this is added to our backlog, but I cannot promise when it will make it given there are many other improvements we would like to make.

对于线程中已经建议的解决方法,我将添加以下内容:

To the workarounds already suggested in the thread I would add the following:

考虑创建一种方法来平衡数据库往返次数与传递给 Contains 的元素数量.例如,在我自己的测试中,我观察到针对 SQL Server 的本地实例计算和执行具有 100 个元素的查询需要 1/60 秒.如果您能以这样的方式编写查询,即使用 100 组不同的 id 执行 100 次查询会为您提供与具有 10,000 个元素的查询等效的结果,那么您可以在大约 1.67 秒而不是 18 秒内获得结果.

Consider creating a method that balances the number of database roundtrips with the number of elements you pass to Contains. For instance, in my own testing I observed that computing and executing against a local instance of SQL Server the query with 100 elements takes 1/60 of a second. If you can write your query in such a way that executing 100 queries with 100 different sets of ids would give you equivalent result to the query with 10,000 elements, then you can get the results in aproximately 1.67 seconds instead of 18 seconds.

不同的块大小应该更好地工作,具体取决于查询和数据库连接的延迟.对于某些查询,即如果传递的序列有重复项,或者如果 Enumerable.Contains 用于嵌套条件,您可能会在结果中获得重复元素.

Different chunk sizes should work better depending on the query and the latency of the database connection. For certain queries, i.e. if the sequence passed has duplicates or if Enumerable.Contains is used in a nested condition you may obtain duplicate elements in the results.

这是一个代码片段(对不起,如果用于将输入分成块的代码看起来有点太复杂了.有更简单的方法来实现同样的事情,但我试图想出一个模式,保留流序列,我在 LINQ 中找不到类似的东西,所以我可能把那部分做得过头了 :) ):

Here is a code snippet (sorry if the code used to slice the input into chunks looks a little too complex. There are simpler ways to achieve the same thing, but I was trying to come up with a pattern that preserves streaming for the sequence and I couldn't find anything like it in LINQ, so I probably overdid that part :) ):

用法:

var list = context.GetMainItems(ids).ToList();

上下文或存储库的方法:

Method for context or repository:

public partial class ContainsTestEntities
{
    public IEnumerable<Main> GetMainItems(IEnumerable<int> ids, int chunkSize = 100)
    {
        foreach (var chunk in ids.Chunk(chunkSize))
        {
            var q = this.MainItems.Where(a => chunk.Contains(a.Id));
            foreach (var item in q)
            {
                yield return item;
            }
        }
    }
}

分割可枚举序列的扩展方法:

Extension methods for slicing enumerable sequences:

public static class EnumerableSlicing
{

    private class Status
    {
        public bool EndOfSequence;
    }

    private static IEnumerable<T> TakeOnEnumerator<T>(IEnumerator<T> enumerator, int count, 
        Status status)
    {
        while (--count > 0 && (enumerator.MoveNext() || !(status.EndOfSequence = true)))
        {
            yield return enumerator.Current;
        }
    }

    public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> items, int chunkSize)
    {
        if (chunkSize < 1)
        {
            throw new ArgumentException("Chunks should not be smaller than 1 element");
        }
        var status = new Status { EndOfSequence = false };
        using (var enumerator = items.GetEnumerator())
        {
            while (!status.EndOfSequence)
            {
                yield return TakeOnEnumerator(enumerator, chunkSize, status);
            }
        }
    }
}

希望这有帮助!

这篇关于为什么 contains() 运算符会如此显着地降低实体框架的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆