识别CouchDB中的重复项 [英] Identifying Duplicates in CouchDB

查看:144
本文介绍了识别CouchDB中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CouchDB和面向文档的数据库的新手.

我一直在研究CouchDB,并且能够熟悉使用perl创建文档以及使用Futon中的Map/Reduce函数查询数据和创建视图的方法.

我仍要弄清楚的一件事是如何使用Futon的Map/Reduce在文档中识别重复值.

例如,如果我有以下文档:

{
  "_id": "123",
  "name": "carl",
  "timestamp": "2012-01-27T17:06:03Z"
}

{
  "_id": "124",
  "name": "carl",
  "timestamp": "2012-01-27T17:07:03Z"
}

我想获取具有重复的名称"值的文档ID列表,这是我可以用Futon Map/Reduce来做的事情吗?

希望达到的结果如下:

{
  "name": "carl",
  "dupes": [ "123", "124" ]
}

.. or ..

{
  "carl": [ "123", "124" ]
}

..将是该值,以及包含这些重复值的关联文档ID.

我已经尝试过使用Map/Reduce进行一些其他操作,但是据我所知,Map函数可以基于每个文档使用数据,而Reduce函数仅允许您使用键/值来自给定的文档.

我知道我可以使用perl提取所需的数据,在其中运行魔术,并获得所需的结果,但是我现在仅尝试与CouchDB一起使用,以便更好地了解它的好处/局限性. /p>

我正在考虑的另一种方法是使用单个文档,如RDBMS表:

{
  "_id": "names",
  "rec1": {
    "_id": "123",
    "name": "carl",
    "timestamp": "2012-01-27T17:06:03Z"
  },
  "rec2": {
    "_id": "124",
    "name": "carl",
    "timestamp": "2012-01-27T17:07:03Z"
  }
}

..这应该允许我以我最初的想法使用Map/Reduce函数.但是我不确定这是否理想.

我知道我的思想仍然停留在RDBMS领域,因此我上面所做的很多事情可能都是不必要的.任何对此的见解将不胜感激.

谢谢!

修复了某些示例中的JSON语法.

解决方案

如果只需要一个唯一值列表,那很容易.如果您希望识别重复项,那么它会变得不那么容易.

在两种情况下,这样的地图函数就足够了:

function (doc) {
   emit(doc.name);
}

对于您的归约功能,只需输入_count.

您的视图输出将类似于:(基于您的2个文档)

{
    "rows": [
        { "key": "carl", "value": 2 }
    ]
}

从那里,您将看到一个名称列表及其频率.您可以采用该列表并自己进行过滤,也可以采用全沙发"路线,并使用 _list函数执行最终过滤.

function (head, req) {
    var row, duplicates = [];
    while (row = getRow()) {
        if (row.value > 1) {
            duplicates.push(row);
        }
    }
    send(JSON.stringify(duplicates));
}

阅读有关_list函数的信息,它们非常方便且通用.

I'm new to CouchDB and document-oriented databases in general.

I've been playing around with CouchDB, and was able to get familiar with creating documents (with perl) and using the Map/Reduce functions in Futon to query the data and create views.

One of the things I'm still trying to figure out is how to identify duplicate values across documents using Futon's Map/Reduce.

For example, if I have the following documents:

{
  "_id": "123",
  "name": "carl",
  "timestamp": "2012-01-27T17:06:03Z"
}

{
  "_id": "124",
  "name": "carl",
  "timestamp": "2012-01-27T17:07:03Z"
}

And I wanted to get a list of document id's that had duplicate "name" values, is this something I could do with the Futon Map/Reduce?

The result was hoping to achieve is as follows:

{
  "name": "carl",
  "dupes": [ "123", "124" ]
}

..or..

{
  "carl": [ "123", "124" ]
}

.. which would be the value, and associated document ids which contain those duplicate values.

I've tried a few different things with Map/Reduce, but so far as I understand, the Map function works with data on a per-document basis, and the Reduce functions only allow you to work with the keys/values from a given document.

I know i could just pull the data I need with perl, work magic there, and get the result I want, but I'm trying to work only with CouchDB for now in order to better understand it's benefits / limitations.

Another way I'm thinking about doing this is to use a single document like an RDBMS table:

{
  "_id": "names",
  "rec1": {
    "_id": "123",
    "name": "carl",
    "timestamp": "2012-01-27T17:06:03Z"
  },
  "rec2": {
    "_id": "124",
    "name": "carl",
    "timestamp": "2012-01-27T17:07:03Z"
  }
}

.. which should allow me to use the Map/Reduce functions in the way I originally thought. However I'm not sure if this is ideal.

I understand that my mind is still stuck in RDBMS land, so much of what I'm trying to do above may not be necessary. Any insight on this would be much appreciated.

Thanks!

Edit: Fixed JSON syntax in some of the examples.

解决方案

If you merely want a list of unique values, that's pretty easy. If you wish to identify the duplicates, then it gets less easy.

In both cases, a map function like this should suffice:

function (doc) {
   emit(doc.name);
}

For your reduce function, just enter _count.

Your view output will look like: (based on your 2 documents)

{
    "rows": [
        { "key": "carl", "value": 2 }
    ]
}

From there, you will have a list of names as well as their frequency. You can take that list and filter it yourself, or you can take the "all couch" route and use a _list function to perform that final filtering.

function (head, req) {
    var row, duplicates = [];
    while (row = getRow()) {
        if (row.value > 1) {
            duplicates.push(row);
        }
    }
    send(JSON.stringify(duplicates));
}

Read up about _list functions, they're pretty handy and versatile.

这篇关于识别CouchDB中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆