需要与前一行相关联的 Map Reduce 类型的查询 [英] Map Reduce kind of query with need to correlate with previous row

查看:27
本文介绍了需要与前一行相关联的 Map Reduce 类型的查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下架构:

  • 客户 ID
  • 地点名称
  • 访问时间
  • 已购买的商品//这是一个清单

由于这是一个非结构化数据,像 MongoDB 这样的 Flat DB 将是自然而然的选择.我们使用的是 MongoDB.

Since this is an unstructured data, Flat DB like MongoDB will be natural fit. We are using MongoDB.

该数据存储了各个位置的客户访问信息.假设我想找出某一天的重复访问次数.重复访问的逻辑很简单:如果一个人今天光顾过一家商店,而之前光顾过同一家商店,那么他就是该商店的回头客.

This data stores the client visit information at various locations. Suppose I want to find out the number of repeat visits on a particular day. The logic for Repeat visit is simple: If a person has visited a store today had visited the same store earlier, then he's a repeat visitor at that store.

我有一个逻辑可以用来找出重复访问者的数量:

I've a logic using which I could find out the number of repeat visitors:

查询:从模式顺序中按位置 ID ASC、客户 ID ASC、访问时间"ASC 选择 *

Query: Select * From Schema order by Location ID ASC , Client ID ASC, 'Time of Visit' ASC

在对上述查询的数据进行排序后,如果 LocationID 和客户 ID 匹配,我们就可以比较访问时间"上一行和下一行.如果差异大于 1 天,则为重复访问.

Once the data for this above query is sorted, we can compare 'Time of Visit' previous and next rows if LocationID and Client ID match. If the difference if of > 1 day, its a repeat visit.

由于这些数据非常庞大,连接类型的查询效率非常低(即使在 MongoDB 中是可能的).

Since this data is very huge, join type of query would be highly inefficient (Even if it were possible in MongoDB).

现在我知道MongoDB中有map reduce框架.但是,是否可以在之前和当前记录之间进行比较,并在此基础上进行一些计算,然后可以触发映射/减少?

Now I understand that there is map reduce framework in MongoDB. However, is it possible to perform comparison between previous and current record, and do some computation based on that, after which map/reduce could be triggered?

示例:

  • 客户 A 在第 1 天访问商店 B//没有重复访问
  • 客户 A 在第 1 天再次访问商店 B//仍然没有重复访问
  • 客户 A 在第 2 天访问商店 B//在第 2 天对该客户 A 重复访问
  • 客户 A 在第 2 天访问商店 B//已计为第 2 天的重复访问
  • 客户 A 在第 3 天访问商店 B//在第 3 天对该客户重复访问

  • Customer A visits store B on day 1 // No repeat visit
  • Customer A visits store B again on day 1 // Still no repeat visit
  • Customer A visits store B on day 2 // Repeat visit for this customer A on day 2
  • Customer A visits store B on day 2 // Already counted as repeat visit on day 2
  • Customer A visits store B on day 3 // Repeat visit for this customer on day 3

客户 C 在第 2 天访问商店 B//客户 C​​ 的第一次访问,而不是重复访问

Customer C visits store B on day 2 // First visit for customer C, not a repeat visit

客户 C 在第 4 天访问商店 B//在第 4 天重复访问

Customer C visits store B on day 4 // Repeat Visit on day 4

客户 D 在第 5 天访问商店 B//第一次访问,而不是重复访问

Customer D visits store B on day 5 // First visit, not a repeat visit

重复访问的最终输出:

  • 商店 B,第 1 天:0 次重复访问
  • 商店 B,第 2 天:2 次重复访问
  • 商店 B,第 3 天:2 次重复访问
  • 商店 B,第 4 天:1 次重复访问
  • 商店 B,第 5 天:0 次重复访问

推荐答案

如果您在关系数据库中执行此操作,您将不会逐行比较访问,而是使用聚合查询来查找重复访问(使用 SELECT ... GROUP BY) 所以你应该在 MongoDB 中做同样的事情.

If you were doing this in relational database you would not be comparing visits row by row, instead you would use an aggregation query to find repeat visits (using SELECT ... GROUP BY) so you should do it the same way in MongoDB.

首先,您需要汇总每家商店每天每位客户的访问量:

First you need to aggregate visits per customer per store per day:

group1 = { "$group" : {
        "_id" : {
            "c" : "$clientId",
            "l" : "$location",
            "day" : {
                "y" : {
                    "$year" : "$tov"
                },
                "m" : {
                    "$month" : "$tov"
                },
                "d" : {
                    "$dayOfMonth" : "$tov"
                }
            }
        },
        "visits" : {
            "$sum" : 1
        }
    }
};

EDIT 因为您接下来只想重复 DAYS,所以您将按客户、商店分组并计算该客户访问该商店的不同 DAYS:

EDIT since you want only repeat DAYS next you would group by customer, by store and count how many different DAYS there were for visits by that customer to that store:

group2 = {"$group" : 
    {"_id" : {
        "c" : "$_id.c",
        "s" : "$_id.l"
    },
    "totalDays" : {
        "$sum" : 1
    }
} };

然后,如果同一客户在多天内多次光顾同一家商店,您只想包括上面的记录:

Then you want to only include the records from above where there were more than one visit by the same customer to the same store across multiple days:

match = { "$match" : { "totalDays" : { "$gt" : 1 } } };

这是一个示例数据集以及使用上述管道操作进行聚合的结果:

Here's a sample data set and the result of this aggregations using the above pipeline operations:

> db.visits.find({},{_id:0,purchases:0}).sort({location:1, clientId:1, tov:1})
{ "clientId" : 1, "location" : "l1", "tov" : ISODate("2013-01-01T20:00:00Z") }
{ "clientId" : 1, "location" : "l1", "tov" : ISODate("2013-01-01T21:00:00Z") }
{ "clientId" : 1, "location" : "l1", "tov" : ISODate("2013-01-03T20:00:00Z") }
{ "clientId" : 2, "location" : "l1", "tov" : ISODate("2013-01-01T21:00:00Z") }
{ "clientId" : 3, "location" : "l1", "tov" : ISODate("2013-01-01T21:00:00Z") }
{ "clientId" : 3, "location" : "l1", "tov" : ISODate("2013-01-02T21:00:00Z") }
{ "clientId" : 1, "location" : "l2", "tov" : ISODate("2013-01-01T23:00:00Z") }
{ "clientId" : 3, "location" : "l2", "tov" : ISODate("2013-01-02T21:00:00Z") }
{ "clientId" : 3, "location" : "l2", "tov" : ISODate("2013-01-02T21:00:00Z") }
{ "clientId" : 1, "location" : "l3", "tov" : ISODate("2013-01-03T20:00:00Z") }
{ "clientId" : 2, "location" : "l3", "tov" : ISODate("2013-01-04T20:00:00Z") }
{ "clientId" : 4, "location" : "l3", "tov" : ISODate("2013-01-04T20:00:00Z") }
{ "clientId" : 4, "location" : "l3", "tov" : ISODate("2013-01-04T21:00:00Z") }
{ "clientId" : 4, "location" : "l3", "tov" : ISODate("2013-01-04T22:00:00Z") }

> db.visits.aggregate(group1, group2, match)
{
    "result" : [
    {
        "_id" : {
            "c" : 3,
            "s" : "l1"
        },
        "totalDays" : 2
    },
    {
        "_id" : {
            "c" : 1,
            "s" : "l1"
        },
        "totalDays" : 2
    }
    ],
    "ok" : 1
}

这篇关于需要与前一行相关联的 Map Reduce 类型的查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆