需要帮助来了解使用带有HashSet< T>的LINQ Join的意外行为. [英] Need help understanding unexpected behavior using LINQ Join with HashSet<T>

查看:104
本文介绍了需要帮助来了解使用带有HashSet< T>的LINQ Join的意外行为.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用C#HastSet和我不理解的LINQ的Join方法遇到一些奇怪的行为.我简化了我的工作,以帮助专注于自己所看到的行为.

I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.

我有以下内容:

 private HashSet<MyClass> _mySet; // module level

 IEnumerable<ISearchKey> searchKeys; // parameter.
 // Partial key searches are allowed.

 private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
 // Compares instances of MyClass and ISearchKey to determine 
 // if they match.

为此

  1. searchKeys和_mySet之间存在一对多的关系.
  2. MyClass实现接口IPartialKey和ICoreKey.
  3. ISearchKey继承自IPartialKey和ICoreKey.
  4. MyClass和ISearchKey实例均覆盖GetHashCode方法.
  5. MyClass的哈希码值基于其完整键值,即 包括其ICoreKey和IPartialKey值以及其他字段.
  6. MyClass使用的完整键不是唯一的.两个不同的MyClass实例可以具有相同的哈希码.
  7. ISearchKey的哈希码值仅基于其ICoreKey和 IPartialKey值.即ISearchKey哈希码可能与 匹配的MyClass实例的哈希码. (旁注:如果我先 遇到问题,则ISearchKey的IPartialKey值匹配 MyClass全键,因此GetHashCode方法将返回 ISearchKey和MyClass的值相同.我包括了 额外的复杂性以更好地说明基础逻辑 我在做什么.)
  8. _coreKeyComparer.GetHashCode方法返回 使用以下方法匹配ISearchKey和MyClass实例的值相同 仅其ICoreKey值.
  9. _coreKeyComparer.Equals方法强制转换 将参数分别传递给MyClass和ISearchKey并返回 如果它们的IPartialKey值匹配,则为true. (边注: _coreKeyComparer已经过严格测试,可以正常工作.)
  1. There is a 1-to-many relationship between searchKeys and _mySet.
  2. MyClass implements interface IPartialKey and ICoreKey.
  3. ISearchKey inherits from IPartialKey and ICoreKey.
  4. MyClass and ISearchKey instance both override the GetHashCode method.
  5. MyClass's hash code value is based on its full key value, which includes its ICoreKey and IPartialKey values plus other fields.
  6. The full keys used by MyClass are not unique. Two different MyClass instances can have the same hash code.
  7. ISearchKey's hash code value is based only on its ICoreKey and IPartialKey values. i.e. The ISearchKey hash code might differ from the hash code for a matching MyClass instance. (Side note: in the case where I first encountered the problem, the ISearchKey's IPartialKey values match the MyClass full key, so the GetHashCode methods would return the same value for both ISearchKey and MyClass. I included the additional complexity to better illustrate the underlying logic on what I am doing.)
  8. The _coreKeyComparer.GetHashCode method returns the same value for matching instances of ISearchKey and MyClass using only their ICoreKey values.
  9. The _coreKeyComparer.Equals method cast the parameters to MyClass and ISearchKey respectively and returns true if their IPartialKey values match. (Side note: _coreKeyComparer has been HEAVILY tested and works correctly.)

我希望两个集合之间的连接会产生类似以下的结果:

{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....

即同一个ISearchKey实例将发生多次,对于它所连接的每个匹配的MyClass实例,都将发生一次.

i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.

        var matchedPairs = searchKeys
          .Join(
            _mySet,
            searchKey => searchKey,
            myClass => myClass,
            (searchKey, myClass) => new {searchKey, myClass},
            _coreKeyComparer)
            .ToList();

每个searchKeyClass实例只有一个MyClass实例.也就是说,matchedPairs集合如下所示:

I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:

    {searchKey_a, myClass_a1},
    {searchKey_b, myClass_b1},
    {searchKey_c, myClass_c1},
etc....

但是,如果我取消连接,请从_mySet转到searchKeys:

   var matchedPairs = _mySet
          .Join(
            searchKeys,
            myClass => myClass,
            searchKey => searchKey,
            (myClass, searchKey) => new {searchKey, myClass},
            _coreKeyComparer)
            .ToList();

我得到了正确的matchPairs集合. _mySet中的所有匹配记录都将与它们匹配的searchKey一起返回.

I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.

我检查了文档并检查了多个示例,但看不到searchKeys-to__mySet连接给出错误答案的任何原因,而_mySet-to-searchKeys提供正确/不同的答案.

I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.

(附带说明:我还尝试了从searchKeys到_myset的GroupJoin并获得相似的结果.即,每个searchKeyClass实例最多从_mySet找到一个结果.)

(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)

或者我不明白Join方法应该如何工作,或者Join在HashSet中的工作方式与List或其他类型的集合不同.

Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.

如果是前者,我需要澄清一下,这样以后就不会在使用Join时出错了.

If the former, I need clarification so I don't make mistakes using Join in the future.

如果是后者,那么这种不同的行为是.NET错误,还是HashSet的正确行为?

If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?

假设行为正确,我将不胜感激有人解释了此(意外的)Join/HashSet行为背后的基本逻辑.

Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.

请注意,我已经修复了我的代码,可以返回正确的结果,我只是想了解为什么最初得到的结果不正确.

Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.

推荐答案

您的错误几乎可以肯定是问题中未显示的大量代码中的某个地方.我的建议是将您的程序简化为可能会产生错误的最简单程序.这样一来,您要么找到您的错误,要么生成一个非常简单的程序,可以将所有问题发布到问题中,然后我们可以对其进行分析.

Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.

假设行为正确,那么不胜感激的人会解释这种(意外的)Join/HashSet行为背后的基本逻辑.

Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.

由于我不知道意外行为是什么,所以无法说出发生原因.但是,我可以准确地说出Join的作用,也许会有所帮助.

Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join does, and perhaps that will help.

Join需要以下内容:

  • 外部"集合-Join的接收者.
  • 内部"集合-扩展方法的第一个参数
  • 两个密钥提取器,可从外部和内部集合中提取密钥
  • 一个投影,它采用键匹配的内部和外部集合的成员,并为该匹配产生结果
  • 比较两个键是否相等的比较操作.
  • An "outer" collection -- the receiver of the Join.
  • An "inner" collection -- the first argument of the extension method
  • Two key extractors, that extract a key from the outer and inner collections
  • A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
  • A comparison operation that compares two keys for equality.

这是Join的工作方式. (这在逻辑上是 ;实际的实现细节有所优化.)

Here's how Join works. (This is logically what happens; the actual implementation details are somewhat optimized.)

首先,我们仅对一次内部"集合进行迭代.

First, we iterate over the "inner" collection, exactly once.

对于内部集合的每个元素,我们提取其键,然后形成一个多字典,将字典中的键映射到内部集合中由键选择器生成该键的所有元素的集合.使用提供的比较对键进行相等性比较.

For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.

因此,我们现在可以从TKeyIEnumerable<TInner>进行查找.

Thus, we now have a lookup from TKey to IEnumerable<TInner>.

第二,我们仅对一次外部"集合进行迭代.

Second, we iterate over the "outer" collection, exactly once.

对于外部集合的每个元素,我们提取其键,然后再次使用提供的键比较在多字典中对该键进行查找.

For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.

然后,我们对内部集合的每个匹配元素进行嵌套循环,在外部/内部对上调用投影,然后得出结果.

We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.

也就是说,Join的行为类似于以下伪代码实现:

That is, Join behaves like this pseudocode implementation:

static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
  (IEnumerable<TOuter> outer, 
  IEnumerable<TInner> inner, 
  Func<TOuter, TKey> outerKeySelector, 
  Func<TInner, TKey> innerKeySelector, 
  Func<TOuter, TInner, TResult> resultSelector, 
  IEqualityComparer<TKey> comparer) 
{
  var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
  foreach(TInner innerItem in inner)
  {
    TKey innerKey = innerKeySelector(innerItem);
    lookup.Add(innerItem, innerKey);
  }
  foreach (TOuter outerItem in outer) 
  {
    TKey outerKey = outerKeySelector(outerItem);
    foreach(TInner innerItem in lookup[outerKey])
    {
      TResult result = resultSelector(outerItem, innerItem);
      yield return result;
    }
  }
}


一些建议:


Some suggestions:

  • 替换所有的GetHashCode实现,以便它们返回0,并运行所有测试.他们应该通过! GetHashCode返回零始终是合法的.这样做几乎肯定会破坏您的性能,但一定不能破坏您的正确性.如果您处于要求特定的非零值GetHashCode的情况下,那么您将遇到一个错误.
  • 测试您的密钥比较,以确保它是有效的比较.它必须遵守三个相等规则:(1)自反性:事物始终等于自身;(2)对称性:AB的相等性必须与BA相同,(3 )可传递性:如果A等于B并且B等于C,则A必须等于C.如果不满足这些规则,则Join可能会表现异常.
  • SelectManyWhere替换Join.那就是:

  • Replace all your GetHashCode implementations so that they return 0, and run all your tests. They should pass! It is always legal to return zero from GetHashCode. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value of GetHashCode, then you have a bug.
  • Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of A and B must be the same as B and A, (3) transitivity: if A equals B and B equals C then A must equal C. If these rules are not met then Join can behave weirdly.
  • Replace your Join with a SelectMany and a Where. That is:

from o in outer join i in inner on getOuterKey(o) equals getInnerKey(i) select getResult(o, i)

from o in outer join i in inner on getOuterKey(o) equals getInnerKey(i) select getResult(o, i)

可以改写为

from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)

该查询比连接版本慢 ,但在逻辑上 完全相同.再次,运行测试.你得到相同的结果吗?如果不是,您的逻辑中存在一个错误.

That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.

同样,我不能十分强调您的态度,即给定哈希表时联接可能被破坏"的态度使您无法找到错误.加入没有中断.该代码十年来没有发生变化,它非常简单,并且在我们第一次编写它时是正确的.更有可能的是,您复杂而神秘的键比较逻辑在某处被破坏了.

Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.

这篇关于需要帮助来了解使用带有HashSet&lt; T&gt;的LINQ Join的意外行为.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆