合并重叠的数据集 [英] Merging overlapping datasets

查看:95
本文介绍了合并重叠的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



如果有多个数据集可能在一个或多个列上可能不会重叠,那么我希望将数据集合在一起。那将以这种方式合并数据集?如何使用单列作为键?



示例:合并两个数据集,使用多个列作为键(BookTitle,Author)



输入,数据集1

  BookTitle,Author,Publisher 
title1,author1,publisher1
title2,author2,publisher2
title3,author3,publisher3

输入,数据集2

  BookTitle,作者,NumPages 
title4 ,author4,numPages4
title7,author7,numPages7
title5,author5,numPages5
title3,author33,numPages3
title2,author2,numPages2

输出密码数据集

  BookTitle,Author,Publisher,NumPages 
title1,author1,publisher1,_null_
title2,author2,publisher2,numPages2
title3,author3,publisher3,_null_
title4, author4,_null_,numPages4
title5,author5,_null_,numPages5
ti tle7,author7,_null_,numPages7
title3,author33,_null_,numPages3

我已经完成一些研究没有立即有用的(主要是关于JSON对象在同一结构中的一次性合并(即,附加数据,而不是合并不同的数据集)。



我正在寻找一个Java / JavaScript,使用JSON / XML / CSV数据(按优先顺序),但会接受其他语言,假设这些算法可以移植。



我也会考虑接受仅在单个列中进行的例子。

解决方案

真的很想找一个图书馆这么简单的东西。您可以尝试自己构建一个解决方案。



您可以先将 JSON.parse()任何字符串转换成对象。然后,你可以将这两个对象传递给一个这样的函数。

  function mergeSets(first,second){
var result = first;
second.forEach(function(item,index,array){
var resultIndex = contains(result,item);
if(resultIndex === -1){
result .push(item);
} else {
result [resultIndex] .numPages = item.numPages;
}
});
返回结果;
}

请注意, mergeSets()调用 contains(),其实质上如下。

 函数contains(set,object){
var solution = -1;
set.forEach(function(item,index,array){
if(item.bookTitle == object.bookTitle&& item.author == object.author){
solution = index;
}
});
返回解决方案;
}

真的不是太难了,你可以看到。对于一些变量名称很抱歉这是匆忙写的。另外,在你的例子中,你提到你想要的字段是不可用的显示为 null 这是不适当的,因为 null 通常表示空引用。相反,我忽略了他们。访问数组中没有它们的对象的那些字段将导致 undefined ,这是非常有意义的。



此外,以下是小提琴中的代码的限制。您可以编辑它以减轻这些限制,并使其更加健壮。


  1. 它与您提到的数据格式有关在你的问题为了使其适用于任意集,您可以在for-in循环中使用 Object.hasOwnProperty()检查属性的存在,并添加必要的一个,从而导致合并。


  2. 无论如何,它不会处理集合中的重复。


http://jsfiddle.net/x5Q5g/



编辑:哦!顺便说一下,代码是JavaScript,数据格式可以是JSON,只要您使用 JSON.parse() JSON.stringify()



编辑:以下更新否定了上述第一个限制。请注意,您需要明确地传递密钥进行比较。

 函数包含(set,object,key ){
var solutions = -1;
set.forEach(function(item,index,array){
if(item [key] === object [key]){
solution = index;
}
});
返回解决方案;
}

函数mergeSets(first,second,key){
var result = first;
second.forEach(function(item,index,array){
var resultIndex = contains(result,item,key);
if(resultIndex === -1){
result.push(item);
} else {
result [resultIndex] .numPages = item.numPages;
for(var property in item){
if(item。 hasOwnProperty(property)){
if(!result [resultIndex] .hasOwnProperty(property)){
result [resultIndex] .property = item.property;
}
}
}
}
});
返回结果;
}

var solution = mergeSets(firstSet,secondSet,bookTitle);
console.log(solution);

http://jsfiddle.net/s6HqL/



最后更新:以下是如何使其接受钥匙数量我忘了你需要多重关键支持。对不起!



您需要更改以下内容。

 函数包含(set,object,keys){
var solution = -1;
set.forEach(function(item,index,array){
var selfItem = item;
var allKeys = keys.every(function(item,index,array){
if(selfItem [item] === object [item]){
return true;
}
});
if(allKeys){
solution = index ;
}
});
返回解决方案;
}

函数mergeSets(first,second){
var result = first;
var keys = Array.prototype.slice.call(arguments,2);
second.forEach(function(item,index,array){
var resultIndex = contains(result,item,keys);
if(resultIndex === -1){
result.push(item);
} else {
for(var property in item){
if(item.hasOwnProperty(property)){
if(!result [ resultIndex] .hasOwnProperty(property)){
var hello = result [resultIndex];
hello [property] = item [property];
}
}
}
}
});
返回结果;
}

var solution = mergeSets(firstSet,secondSet,bookTitle,author);
console.log(solution);

http://jsfiddle.net/s6HqL/3/



上面的最后一个小提琴和代码是完整的。没有任何参考!而且是通用的。可以使用任意数量的键作为参数。


Given multiple datasets that may/may not overlap on one or more columns, I am looking to dynamically merge the datasets together.

Is there a library or code snippet that will merge datasets in this manner? How about just doing it using a single column as a key?

EXAMPLE: Merging two datasets, using multiple columns as keys (BookTitle, Author)

Input, Dataset 1

BookTitle, Author, Publisher
title1, author1, publisher1
title2, author2, publisher2
title3, author3, publisher3

Input, Dataset 2

BookTitle, Author, NumPages
title4, author4, numPages4
title7, author7, numPages7
title5, author5, numPages5
title3, author33, numPages3
title2, author2, numPages2

Output, Munged Dataset

BookTitle, Author, Publisher, NumPages
title1, author1, publisher1, _null_
title2, author2, publisher2, numPages2
title3, author3, publisher3, _null_
title4, author4, _null_, numPages4
title5, author5, _null_, numPages5
title7, author7, _null_, numPages7
title3, author33, _null_, numPages3

I have done some research and nothing immediately useful came up (mostly about one-time merges of JSON objects in the same structure (ie, appending data, as opposed to merging distinct datasets)).

I am looking for a Java/JavaScript, using JSON/XML/CSV data (in order of preference) but will accept other languages assuming that those algorithms can be ported over.

I will also consider accepting examples where this is being done on a single column only.

解决方案

Well, I wouldn't really look for a library for something so simple. Instead try building a solution yourself.

You could first JSON.parse() any strings to convert them into objects. Then, you could pass both these objects into a function that looks something like this.

function mergeSets(first, second) {
    var result = first;
    second.forEach(function (item, index, array) {
        var resultIndex = contains(result, item);
        if (resultIndex === -1) {
            result.push(item);
        } else {
            result[resultIndex].numPages = item.numPages;
        }
    });
    return result;
}

Notice that mergeSets() calls contains() which is essentially as follows.

function contains(set, object) {
    var solution = -1;
    set.forEach(function (item, index, array) {
        if (item.bookTitle == object.bookTitle && item.author == object.author) {
            solution = index;
        }
    });
    return solution;
}

It really isn't too hard as you can see. Sorry for some of the variable names. This was written in a hurry. Also, you mention in your example of the resulting set that you would like the fields that aren't available to be displayed as null which is not appropriate since null usually indicates an empty reference. Instead, I have ignored them. Accessing those fields on objects in the array that don't have them would result in undefined which makes perfect sense.

Also, the following are the limitations of the code in the fiddle. You may edit it to ease out these limitations and make it more robust.

  1. It is tied to the data format you have mentioned in your question. To make it work for arbitrary sets, you can check for the existence of a property using Object.hasOwnProperty() in a for-in loop and add the necessary ones resulting in a merge.

  2. It doesn't handle duplicates within sets in anyway.

http://jsfiddle.net/x5Q5g/

Edit: Oh! And by the way, the code is JavaScript and the data format could be JSON provided you use JSON.parse() and JSON.stringify().

Edit: The following updates negate the first limitation mentioned above. Notice that you need to pass in the key to compare on the basis of, explicitly.

function contains(set, object, key) {
    var solution = -1;
    set.forEach(function (item, index, array) {
        if (item[key] === object[key]) {
            solution = index;
        }
    });
    return solution;
}

function mergeSets(first, second, key) {
    var result = first;
    second.forEach(function (item, index, array) {
        var resultIndex = contains(result, item, key);
        if (resultIndex === -1) {
            result.push(item);
        } else {
            result[resultIndex].numPages = item.numPages;
            for (var property in item) {
                if (item.hasOwnProperty(property)) {
                    if (!result[resultIndex].hasOwnProperty(property)) {
                        result[resultIndex].property = item.property;
                    }
                }
            }
        }
    });
    return result;
}

var solution = mergeSets(firstSet, secondSet, "bookTitle");
console.log(solution);

http://jsfiddle.net/s6HqL/

One final update: Here is how you can make it accept any number of keys. I had forgotten that you require multiple key support. Sorry!

You need to change the following.

function contains(set, object, keys) {
    var solution = -1;
    set.forEach(function (item, index, array) {
        var selfItem = item;
        var allKeys = keys.every(function (item, index, array) {
            if (selfItem[item] === object[item]) {
                return true;
            }
        });
        if (allKeys) {
            solution = index;
        }
    });
    return solution;
}

function mergeSets(first, second) {
    var result = first;
    var keys = Array.prototype.slice.call(arguments, 2);
    second.forEach(function (item, index, array) {
        var resultIndex = contains(result, item, keys);
        if (resultIndex === -1) {
            result.push(item);
        } else {
            for (var property in item) {
                if (item.hasOwnProperty(property)) {
                    if (!result[resultIndex].hasOwnProperty(property)) {
                        var hello = result[resultIndex];
                        hello[property] = item[property];
                    }
                }
            }
        }
    });
    return result;
}

var solution = mergeSets(firstSet, secondSet, "bookTitle", "author");
console.log(solution);

http://jsfiddle.net/s6HqL/3/

This last fiddle and code above it is complete. Without any references! And is generic. Would work with any number of keys as arguments.

这篇关于合并重叠的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆