MongoDB中两个集合之间的Diff() [英] Diff() between two collections in MongoDB

查看:410
本文介绍了MongoDB中两个集合之间的Diff()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经做过研究.如果这是一个重复的问题,我深表歉意,但其他问题中的解决方案并不是我真正适合的,因此,我提出了一个新问题.

I have done research. I apologize if this is a duplicate question, but the solutions in other questions were not really my fit, and thus, I made a new question.

使用JavaScript比较两个集合的最佳方法是什么?

What is the best way with Javascript to compare two collections?

我有成千上万个这种Mongo文档格式的标题:

I have thousands of these headers in this Mongo document format:

{
    "url": "google.com",
    "headers": {
        "location": "http://www.google.com/",
        "content-type": "text/html; charset=UTF-8",
        "date": "Mon, 25 Mar 2013 18:12:08 GMT",
        "expires": "Wed, 24 Apr 2013 18:12:08 GMT",
        "cache-control": "public, max-age=2592000",
        "server": "gws",
        "content-length": "219",
        "x-xss-protection": "1; mode=block",
        "x-frame-options": "SAMEORIGIN"
    }
}

我今天用刮刀跑了.将来,我会再次运行它,并将其存储在第二个集合中.另外,我希望能够比较三个特定的标头对象,分别是serverx-aspnet-versionx-powered-by,并检测是否存在任何整数增量.

I ran my scraper today. I would, in the future, run it again, and store that in a second collection. Additionally, I would like to be able to compare three specific header objects, and that is server, x-aspnet-version, and x-powered-by, and detect if there are any integer increments.

迭代两个集合并执行diff()的最佳方法是什么?

What is the best way to iterate through two collections and do a diff()?

我做对了吗?任何建议将不胜感激.

Am I doing it right? Any suggestions would be really appreciated.

推荐答案

一些建议:

您可以使用url和访问日期的组合(至少是datetime对象的一部分)作为这些对象的_id,因为据我所知,您计划每月刮取每个url.

You could use a combination of url and the date accessed (at least part of the datetime object) as the _id for these objects since from what I can tell you plan to scrape each url once a month.

示例:

{
    "_id": {
        "url": "www.google.com",
        "date": ISODate("2013-03-01"),
    },
    // Other attributes
}

这会带来性能,独特性和查询优势(请参阅这则4平方米的博客帖子).您可以查询执行以下操作:

This yields performance, uniqueness, and query dividends (see this 4sq blog post). You could query doing something like:

db.collection.find({
    "_id": {
        "$gte": {
            "url": yourUrl,
            "date": rangeStart
         },
         "$lt": {
            "url": yourUrl,
            "date": rangeEnd
         },
    }
})

产生的结果非常好,排序很好(按日期的url THEN,似乎正是您想要的).如果您只想收集一组不错的所有URL和您抓取的月份,则也可以使用此索引来执行覆盖的查询(在_id字段上)(这可以使您一次遍历每个URL很好)

Which yields excellent, nicely sorted (by url THEN by date, which seems to be just what you want) results. You could also use this index to perform covered queries (over the _id field) if you just want a nice set of all of the urls and months you have scraped (this could set you up nicely to go through each url one at a time).

如果您有想要比较的文档的特定属性(例如,headers.server)和要为其进行的特定比较(例如,查找版本号的任何增量),我将使用一种正则表达式,用于获取与版本号相关的元素(一种快速而肮脏的元素可能只是检索所有数字元素)并为每个url绘制它们的图形(我认为这将使您可视化随时间变化的服务器软件).只要对这些属性中的任何一个进行更改,就可以很容易地进行报告,方法是按顺序扫描它们,并在字符串不相同时触发某个事件(也许然后报告更改或更改的数字部分).

If you have specific attributes of the document that you're interested in comparing (headers.server for example) and a specific comparison you want to do for them (looking for any increment in version numbers for example), I would use some kind of regex to grab the elements relevant to version number (a quick and dirty one might simply retrieve all numeric elements) and graph them for each url (I assume this would let you visualize changes to server software over time). You could just as easily report whenever any of these attributes changed by scanning them in order and setting off some event when the strings were not identical (perhaps then reporting the change or the numerical piece of the change).

这篇关于MongoDB中两个集合之间的Diff()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆