MongoDB-使用聚合框架或mapreduce来匹配文档中的字符串数组(配置文件匹配) [英] MongoDB - Use aggregation framework or mapreduce for matching array of strings within documents (profile matching)

查看:158
本文介绍了MongoDB-使用聚合框架或mapreduce来匹配文档中的字符串数组(配置文件匹配)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个可以与约会应用程序相提并论的应用程序.

I'm building an application that could be likened to a dating application.

我有一些结构如下的文件:

I've got some documents with a structure like this:

$ db.profiles.find().pretty()

$ db.profiles.find().pretty()

[
  {
    "_id": 1,
    "firstName": "John",
    "lastName": "Smith",
    "fieldValues": [
      "favouriteColour|red",
      "food|pizza",
      "food|chinese"
    ]
  },
  {
    "_id": 2,
    "firstName": "Sarah",
    "lastName": "Jane",
    "fieldValues": [
      "favouriteColour|blue",
      "food|pizza",
      "food|mexican",
      "pets|yes"
    ]
  },
  {
    "_id": 3,
    "firstName": "Rachel",
    "lastName": "Jones",
    "fieldValues": [
      "food|pizza"
    ]
  }
]

我要尝试的是识别一个或多个fieldValues上彼此匹配的配置文件.

What I'm trying to so is identify profiles that match each other on one or more fieldValues.

因此,在上面的示例中,我的理想结果如下所示:

So, in the example above, my ideal result would look something like:

<some query>

result:
[
  {
    "_id": "507f1f77bcf86cd799439011",
    "dateCreated": "2013-12-01",
    "profiles": [
      {
        "_id": 1,
        "firstName": "John",
        "lastName": "Smith",
        "fieldValues": [
          "favouriteColour|red",
          "food|pizza",
          "food|chinese"
        ]
      },
      {
        "_id": 2,
        "firstName": "Sarah",
        "lastName": "Jane",
        "fieldValues": [
          "favouriteColour|blue",
          "food|pizza",
          "food|mexican",
          "pets|yes"
        ]
      },

    ]
  },
  {
    "_id": "356g1dgk5cf86cd737858595",
    "dateCreated": "2013-12-02",
    "profiles": [
      {
        "_id": 1,
        "firstName": "John",
        "lastName": "Smith",
        "fieldValues": [
          "favouriteColour|red",
          "food|pizza",
          "food|chinese"
        ]
      },
      {
        "_id": 3,
        "firstName": "Rachel",
        "lastName": "Jones",
        "fieldValues": [
          "food|pizza"
        ]
      }
    ]
  }
]

我已经考虑过使用map reduce或使用聚合框架来做到这一点.

I've thought about doing this either as a map reduce, or with the aggregation framework.

无论哪种方式,结果"都将保留到集合中(根据上面的结果")

Either way, the 'result' would be persisted to a collection (as per the 'results' above)

我的问题是,哪两个更合适? 我将在哪里开始实施呢?

My question is which of the two would be more suited? And where would I start to implement this?

修改

简而言之,无法轻松更改模型.
这不像传统意义上的个人资料".

In a nutshell, the model can't easily be changed.
This isn't like a 'profile' in the traditional sense.

我基本上想做的事情(以伪代码)是:

What I'm basically looking to do (in psuedo code) is along the lines of:

foreach profile in db.profiles.find()
  foreach otherProfile in db.profiles.find("_id": {$ne: profile._id})
    if profile.fieldValues matches any otherProfie.fieldValues
      //it's a match!

显然,这种操作非常慢!

Obviously that kind of operation is very very slow!

也许值得一提的是,该数据从未显示过,实际上只是一个用于匹配"的字符串值

It may also be worth mentioning that this data is never displayed, it's literally just a string value that's used for 'matching'

推荐答案

MapReduce将在单独的线程中运行JavaScript,并使用您提供的代码来发出和减少文档的某些部分以聚集在某些字段上.您当然可以将练习视为每个"fieldValue"的汇总.聚合框架也可以做到这一点,但会更快,因为聚合将在C ++的服务器上而不是在单独的JavaScript线程中运行.但是聚合框架返回的数据可能会超过16MB,在这种情况下,您将需要对数据集进行更复杂的分区.

MapReduce would run JavaScript in a separate thread and use the code you provide to emit and reduce parts of your document to aggregate on certain fields. You can certainly look at the exercise as aggregating over each "fieldValue". Aggregation framework can do this as well but would be much faster as the aggregation would run on the server in C++ rather than in a separate JavaScript thread. But aggregation framework may return more data back than 16MB in which case you would need to do more complex partitioning of the data set.

但是似乎问题比这简单得多.您只想为每个配置文件查找其他配置文件与之共享哪些特定属性-在不知道数据集大小和性能要求的情况下,我假设您在fieldValues上有一个索引,因此查询效率很高在它上面,然后您可以通过以下简单循环获得所需的结果:

But it seems like the problem is a lot simpler than this. You just want to find for each profile what other profiles share particular attributes with it - without knowing the size of your dataset, and your performance requirements, I'm going to assume that you have an index on fieldValues so it would be efficient to query on it and then you can get the results you want with this simple loop:

> db.profiles.find().forEach( function(p) { 
       print("Matching profiles for "+tojson(p));
       printjson(
            db.profiles.find(
               {"fieldValues": {"$in" : p.fieldValues},  
                                "_id" : {$gt:p._id}}
            ).toArray()
       ); 
 }  );

输出:

Matching profiles for {
    "_id" : 1,
    "firstName" : "John",
    "lastName" : "Smith",
    "fieldValues" : [
        "favouriteColour|red",
        "food|pizza",
        "food|chinese"
    ]
}
[
    {
        "_id" : 2,
        "firstName" : "Sarah",
        "lastName" : "Jane",
        "fieldValues" : [
            "favouriteColour|blue",
            "food|pizza",
            "food|mexican",
            "pets|yes"
        ]
    },
    {
        "_id" : 3,
        "firstName" : "Rachel",
        "lastName" : "Jones",
        "fieldValues" : [
            "food|pizza"
        ]
    }
]
Matching profiles for {
    "_id" : 2,
    "firstName" : "Sarah",
    "lastName" : "Jane",
    "fieldValues" : [
        "favouriteColour|blue",
        "food|pizza",
        "food|mexican",
        "pets|yes"
    ]
}
[
    {
        "_id" : 3,
        "firstName" : "Rachel",
        "lastName" : "Jones",
        "fieldValues" : [
            "food|pizza"
        ]
    }
]
Matching profiles for {
    "_id" : 3,
    "firstName" : "Rachel",
    "lastName" : "Jones",
    "fieldValues" : [
        "food|pizza"
    ]
}
[ ]

很显然,您可以调整查询以不排除已经匹配的配置文件(通过将{$gt:p._id}更改为{$ne:{p._id}}以及其他调整.)但是我不确定使用聚合框架或mapreduce可以从中获得什么额外的价值实际上不是在其字段之一上聚合单个集合(根据您显示的输出格式判断).如果您对输出格式的要求很灵活,那么当然也可以使用内置的聚合选项之一.

Obviously you can tweak the query to not exclude already matched up profiles (by changing {$gt:p._id} to {$ne:{p._id}} and other tweaks. But I'm not sure what additional value you would get from using aggregation framework or mapreduce as this is not really aggregating a single collection on one of its fields (judging by the format of the output that you show). If your output format requirements are flexible, certainly it's possible that you could use one of the built in aggregation options as well.

我确实检查了一下,如果将各个fieldValues汇总起来会是什么样,这还不错,如果您的输出可以匹配以下内容,则可能会对您有所帮助:

I did check to see what this would look like if aggregating around individual fieldValues and it's not bad, it might help you if your output can match this:

> db.profiles.aggregate({$unwind:"$fieldValues"}, 
      {$group:{_id:"$fieldValues", 
              matchedProfiles : {$push:
               {  id:"$_id", 
                  name:{$concat:["$firstName"," ", "$lastName"]}}},
                  num:{$sum:1}
               }}, 
      {$match:{num:{$gt:1}}});
{
    "result" : [
        {
            "_id" : "food|pizza",
            "matchedProfiles" : [
                {
                    "id" : 1,
                    "name" : "John Smith"
                },
                {
                    "id" : 2,
                    "name" : "Sarah Jane"
                },
                {
                    "id" : 3,
                    "name" : "Rachel Jones"
                }
            ],
            "num" : 3
        }
    ],
    "ok" : 1
}

这基本上是说:对于每个由fieldValue组的fieldValue($ unwind)组,由匹配的配置文件_id和名称组成的数组,计算每个fieldValue累积的匹配项数($ group),然后排除只有一个与它匹配的配置文件.

This basically says "For each fieldValue ($unwind) group by fieldValue an array of matching profile _ids and names, counting how many matches each fieldValue accumulates ($group) and then exclude the ones that only have one profile matching it.

这篇关于MongoDB-使用聚合框架或mapreduce来匹配文档中的字符串数组(配置文件匹配)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆