如何最好地比较两组多维数据? [英] How to best compare two sets of multidimensional data?

查看:95
本文介绍了如何最好地比较两组多维数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对不起,这不是一个直接的编码问题 - 很高兴与某人离线或指向最佳技术。这不是课堂作业。

编写C#控制台应用程序。

我需要比较两组数据。数据集是相同的类型(字符串),但长度不同 - 如这两个输入集所示。



Sorry this is not a direct coding question - Happy to take this offline with someone or be pointed to the "best technique". This is not a class assignment.
Writing a C# Console application.
I have two sets of data that I need to compare. The sets of data are the same type (string), but will be different lengths - as indicated in these two Input Sets.

   Input Set 1				           Input Set 2			
1  Apples	Small	Medium	Large	  1  Apricot	Small		
2  Bananas	Small	Medium	Large	  2  Blackberry		Medium	Large
3  Blueberries	Small			  3  Cherries	Small	Medium	Large
4  Cherries	Small	Medium	Large	  4  Grapes	Small		Large
5  Grapes	Small		Large	  5  Oranges	Small	Medium	
6  Pears		Medium		  6  Pears	Small	Medium	
7  Strawberry	Small	Medium	Large	  7  Strawberry	Small	Medium	Large
8  Watermelon	 		Large						





我无法找出比较这两套最有效的方法。可能存在相等的数据,但可能是不同的行,或者数据行可能只存在于一个输入集中,或者行的实际值可能不同。最好的技术是循环列表数组比较 - 或者是二维数组比较 - 还是其他什么?我需要以产生两个相同行数的输出集的方式进行比较,以显示差异,如下所示。我以这种方式显示数据,因为它将以Excel表格显示:





I can't figure out the most efficient way to compare these two sets. Equal data may exist but be different rows, or a data row may exist in just one Input Set, or the actual values of the rows could be different. Is the best technique a looping List Array compare - or a 2d Array compare - or something else? I need to compare in a way that produces two Output Sets of the same number of rows that show the differences as follows. And I am displaying the data this way because it will be presented in an Excel sheet:

Diff?	   Output Set 1				  Output Set 2		
yes	1					1  Apricot	Small	 	 
yes	2  Apples	Small	Medium	Large	2  		
yes	3  Bananas	Small	Medium	Large	3				
yes	4					4  Blackberry		Medium	Large
yes	5  Blueberries	Small		        5  	
no	6  Cherries	Small	Medium	Large	6  Cherries	Small	Medium	Large
no	7  Grapes	Small		Large	7  Grapes	Small		Large
yes	8					8  Oranges	Small	Medium	
yes	9  Pears		Medium		9  Pears	Small	Medium	
no	10 Strawberry	Small	Medium	Large	10 Strawberry	Small	Medium	Large
yes	11 Watermelon			Large	11				

推荐答案

我对您的问题的解释是,您真的不希望看到为您编写的完整代码解决方案。但是,我会在这里发布一些代码片段,我认为会帮助你。



我建议的策略是:



1.作为第一步:编写代码来分析两个数据集以确定重复的类别。根据定义,非重复类别将成为差异输出的一部分。



然后专注于分析重复项以确定其大小列表是否为相同。在这种情况下,重复的是:



樱桃葡萄梨草莓



梳理非重复项:



1.处理数据集中的字符串以消除不必要的空格:
My interpretation of your question is that you really don't want to see a complete code solution written for you. However, I'll post some code fragments here I think will help you along the way.

My suggestion for a strategy here is:

1. as a first step: write code to analyze the two datasets to determine duplicate categories. By definition, non-duplicate categories are going to be part of your output of "differences."

Then focus on analyzing the duplicates to determine if their list of sizes is the same. In this case the duplicates are:

Cherries Grapes Pears Strawberry

To tease out the non-duplicates:

1. process the strings in the datasets to eliminate unnecessary white-space:
// code by Jon Skeet to remove multiple spaces from string
// from:  http://stackoverflow.com/a/1280227/133321
private static readonly Regex MultipleSpaces = new Regex(@" {2,}", RegexOptions.Compiled);

private static string NormalizeWithRegex(string input)
{
// Skeet's code modified here by BW:
// second parameter changed to empty string
return MultipleSpaces.Replace(input, "");
}
// end code by Jon Skeet

清理数据集后,创建将用于分析非重复项的列表:这里我们假设您的清理过的数据集在字符串ds1和ds2中:

After you clean-up the Datasets, create Lists that will be used to analyze for non-duplicates: here we'll assume your cleaned-up datasets are in strings ds1, and ds2:

private List<string> ds1List, ds2List, duplicateCategoryList, nonDuplicateCategoryList, l1SubStrings, l2SubStrings;

private string[] splitCh1 = new string[] {"\r\n"};
private char[] splitCh2 = new char[] { ' ' };

private void MassageData()
{
    ds1 = NormalizeWithRegex(ds1);
    ds2 = NormalizeWithRegex(ds2);
    
    ds1List = ds1.Split(splitCh1, StringSplitOptions.RemoveEmptyEntries).ToList<string>();
    ds2List = ds2.Split(splitCh1, StringSplitOptions.RemoveEmptyEntries).ToList<string>();

    // sorting may or may not pay-off here ?
    ds1List.Sort();
    ds2List.Sort()
};</string></string></string>

了解如何梳理重复类别的要点:



1.解析(for-loop)两个列表中较长的一个,ds1List,ds2List



2.在for循环索引处对字符串使用第二个字符串split(splitCh2)两个列表中较长的一个。



3.拉出拆分列表的第一个元素[0](类别名称),看看是否有该字符串出现在数据集(字符串)中,这是您的较短列表的来源:



a。如果出现:你有一个重复匹配的候选人。



4.你必须考虑在较短的列表中可能有一个类别不是在较长的列表中:你需要编写代码来检查...只有当较长列表的for循环索引小于'较短列表的计数时才检查这个。



5.一旦你有候选重复列表,你就可以比较它们的相关类别值(大小),并从重复列表中取出那些大小不匹配的列表。



希望这能让你开始;关键的想法是尽量减少你在列表处理,分割字符串等方面所做的工作。



...编辑...以响应OP的查询如果两个相同的类别具有不同数量的大小参数会发生什么:



并且,在数据中的Pears类别中,您有不同的数字两个数据集中的大小参数。



这是一个有趣的挑战,因为.NET没有为两个List< T>的内容提供内置的相等比较;对象。使用==或.IsEqual将比较对象的引用。



.NET 3.5为您提供Linq SequenceEqual扩展名:



http://msdn.microsoft.com/en-us/library/bb348567(v=vs.100).aspx



哪个会比较两个内容相等的列表,但它依赖于顺序:这意味着您必须在比较它们之前对两个列表进行排序。



试试这个:



列表l1 =新列表{Grape,Small,Large};

列表l2 =新列表{Grape,大,小};

列表l3 =新列表{Grape,小};



l1.Sort ();

l2.Sort();

l3.Sort();



bool case1 = l1.SequenceEqual(l2);

bool case2 = l1.SequenceEqual(l3);



设置断点并检查布尔值结果。



我写了一个带有两个List< string>的函数并首先测试它们的长度(.Count属性)是否相等;如果长度相等,那么我会对它们进行排序并使用'SequenceEqual。



...结束编辑...

To get to the gist of how to tease-out the duplicate categories:

1. parse (for-loop) the longer of the two Lists, ds1List, ds2List

2. use the second string split (splitCh2) on the string at the for-loop index in the longer of the two lists.

3. pull-out the first element [0] of the split list (the category name), and see if that string appears in the dataset (string) which is the source of your shorter list:

a. if it appears: you have a candidate for a duplicate match.

4. you will then have to consider that there may be a category in the shorter list that is not in the longer list: you'll need to write code to check ... by checking for this only when the for-loop index of the longer list is less than the 'Count of the shorter list.

5. once you have a list of candidate duplicates, you can then compare their associated category value (sizes), and take out, from the duplicates list those where the list of sizes do not match exactly.

Hope this gets you started; the key idea is to minimize the work you do in list processing, splitting strings, etc.

... edit ... in response to OP's query about what happens if two identical categories have different numbers of size parameters:

And, in the category 'Pears in your data, you have a different number of size parameters in the two datasets.

That's an interesting "challenge" because .NET does not provide a built-in equality comparison for the contents of two List<T> Objects. Using == or .IsEqual will compare references to Objects.

.NET 3.5 offers you the Linq SequenceEqual extension:

http://msdn.microsoft.com/en-us/library/bb348567(v=vs.100).aspx

Which will compare two Lists for content equality, but it is order-dependent: that means you'd have to sort the two Lists before comparing them.

Try this:

List l1 = new List{"Grape", "Small", "Large"};
List l2 = new List {"Grape", "Large", "Small" };
List l3 = new List { "Grape", "Small" };

l1.Sort();
l2.Sort();
l3.Sort();

bool case1 = l1.SequenceEqual(l2);
bool case2 = l1.SequenceEqual(l3);

Set a break-point and examine the boolean results.

I'd write a function that took two List<string> and first tested their length (.Count Property) for equality; if the lengths were equal, then I'd Sort them and use 'SequenceEqual.

... end edit ...


这取决于你用它来保存它们,如果你正在使用Arrays然后你可以使用它们的索引来检查它们,例如



This depends on what you use to save them, if you're using Arrays then you can use their indices to check them, for example

if(arr[1] != arr2[1] && arr[2] != arr2[2]) {
  // not equal, 
  return false;
} else {
  // equal
  return true;
}





..但请记住,数组通过IndexOutOfRange异常,如果你试图检查一个查询数组内部不存在,例如在许多情况下,代码中第一个数组中有索引3,第二个数组中没有索引3。



同样的东西适用于.NET Framework中的 List ,所以我猜没有有效的方法这样做,除非你知道数组的所有维度。即使在您自己的问题中,您的阵列也是不同的。



您可以使用简单的纸张来解决这个问题,以定义逻辑。 :-)



祝你好运,



.. but remember, that the arrays through an IndexOutOfRange exception, if you try to check against a query that is not present inside the array, for example in many cases of your code there is index 3 in first array but no index 3 in second array.

Same thing applies to the List in the .NET Framework, so I guess there is no efficient way of doing this, unless you know all the dimensions of the arrays. Both your arrays are different even in side your own question.

You can figure this out using a simple paper, to define the logic. :-)

Good luck,


这篇关于如何最好地比较两组多维数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆