未合并Apache Beam每用户会话窗口 [英] Apache Beam per-user session windows are unmerged

查看:45
本文介绍了未合并Apache Beam每用户会话窗口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个拥有用户的应用程序;每个用户每次使用我们的应用程序的时间约为10-40分钟,我想根据发生的特定事件(例如此用户转换",该用户"上次会话有问题",此用户上次会话成功".

We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session").

(此后,我想每天计算这些更高级别的事件,但这是一个单独的问题)

(After this I'd like to count these higher-level events per day, but that's a separate question)

为此,我一直在研究会话窗口;但所有文档似乎都面向全球会话窗口,但我想按用户创建它们(这也是一个自然的分区).

For this I've been looking into session windows; but all docs seem geared towards global session windows, but I'd like to create them per-user (which is also a natural partitioning).

我无法找到有关如何执行此操作的文档(首选python).你能指出我正确的方向吗?

I'm having trouble finding docs (python preferred) on how to do this. Could you point me in the right direction?

换句话说:如何为每个用户创建每个会话的窗口,以输出更多结构化(丰富)的事件?

class DebugPrinter(beam.DoFn):
  """Just prints the element with logging"""
  def process(self, element, window=beam.DoFn.WindowParam):
    _, x = element
    logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
    yield element

def sum_by_event_type(user_session_events):
  logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
  d = {}
  for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
    d[key] = len(list(group))
  logging.info("After counting: %s", d)
  return d

# ...

by_user = valid \
  | 'keyed_on_user_id'      >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))

session_gap = 5 * 60 # [s]; 5 minutes

user_sessions = by_user \
  | 'user_session_window'   >> beam.WindowInto(beam.window.Sessions(session_gap),
                                               timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
  | 'debug_printer'         >> beam.ParDo(DebugPrinter()) \
  | beam.CombinePerKey(sum_by_event_type)

输出内容

INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)

如您所见; Session()窗口不会展开该窗口,而是仅将非常接近的事件组合在一起……这是怎么做的?

So as you can see; the Session() window doesn't expand the Window, but groups only very close events together... What's being done wrong?

推荐答案

您可以通过在窗口之后添加按组分组"变换来使其工作.您已经为记录分配了键,但实际上并没有通过键将它们分组在一起,因此会话窗口(每个键都可以工作)不知道这些事件需要合并在一起.

You can get it to work by adding a Group By Key transform after the windowing. You have assigned keys to the records but haven't actually grouped them together by key and session windowing (which works per-key) does not know that these events need to be merged together.

为确认这一点,我对一些内存中的伪数据做了一个可复制的示例(以将Pub/Sub与问题隔离开来,并能够更快地对其进行测试).所有这五个事件将具有相同的键或user_id,但是它们将相距"1、2、4和8秒"依次到达".当我使用5秒的session_gap时,我希望将前4个元素合并到同一会话中.第五届比赛将在第四届比赛之后进行8秒钟,因此必须降级到下一场比赛(间隔5秒).数据是这样创建的:

To confirm this I did a reproducible example with some in-memory dummy data (to isolate Pub/Sub from the problem and be able to test it more quickly). All five events will have the same key or user_id but they will "arrive" sequentially 1, 2, 4 and 8 seconds apart from each other. As I use session_gap of 5 seconds I expect the first 4 elements to be merged into the same session. The 5th event will take 8 seconds after the 4th one so it has to be relegated to the next session (gap over 5s). Data is created like this:

data = [{'user_id': 'Thanos', 'value': 'event_{}'.format(event), 'timestamp': time.time() + 2**event} for event in range(5)]

我们使用beam.Create(data)初始化管道,并使用beam.window.TimestampedValue分配假"时间戳.同样,我们只是以此来模拟流式传输行为.之后,借助user_id字段创建键-值对,进入window.Sessions窗口,然后添加缺少的beam.GroupByKey()步骤.最后,我们使用DebugPrinter的稍微修改版本记录结果.管道现在看起来像这样:

We use beam.Create(data) to initialize the pipeline and beam.window.TimestampedValue to assign the "fake" timestamps. Again, we are just simulating streaming behavior with this. After that, we create the key-value pairs thanks to the user_id field, we window into window.Sessions and, we add the missing beam.GroupByKey() step. Finally, we log the results with a slightly modified version of DebugPrinter:. The pipeline now looks like this:

events = (p
  | 'Create Events' >> beam.Create(data) \
  | 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
  | 'keyed_on_user_id'      >> beam.Map(lambda x: (x['user_id'], x))
  | 'user_session_window'   >> beam.WindowInto(window.Sessions(session_gap),
                                             timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
  | 'Group' >> beam.GroupByKey()
  | 'debug_printer'         >> beam.ParDo(DebugPrinter()))

其中DebugPrinter是:

class DebugPrinter(beam.DoFn):
  """Just prints the element with logging"""
  def process(self, element, window=beam.DoFn.WindowParam):
    for x in element[1]:
      logging.info(">>> Received %s %s with window=%s", x['value'], x['timestamp'], window)

    yield element

如果我们在不按键分组的情况下进行测试,则会得到相同的行为:

If we test this without grouping by key we get the same behavior:

INFO:root:>>> Received event_0 1554117323.0 with window=[1554117323.0, 1554117328.0)
INFO:root:>>> Received event_1 1554117324.0 with window=[1554117324.0, 1554117329.0)
INFO:root:>>> Received event_2 1554117326.0 with window=[1554117326.0, 1554117331.0)
INFO:root:>>> Received event_3 1554117330.0 with window=[1554117330.0, 1554117335.0)
INFO:root:>>> Received event_4 1554117338.0 with window=[1554117338.0, 1554117343.0)

但是添加后,窗口现在可以正常工作了.事件0到3在扩展的12s会话窗口中合并在一起.事件4属于一个单独的5s会话.

But after adding it, the windows now work as expected. Events 0 to 3 are merged together in an extended 12s session window. Event 4 belongs to a separate 5s session.

INFO:root:>>> Received event_0 1554118377.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_1 1554118378.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_3 1554118384.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_2 1554118380.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_4 1554118392.37 with window=[1554118392.37, 1554118397.37)

完整代码此处

还有两件事值得一提.第一个是,即使使用DirectRunner在单台计算机上本地运行此记录,记录也可能无序(在我的情况下,event_3在event_2之前被处理).这样做的目的是为了模拟此处所述的分布式处理.

Two additional things worth mentioning. The first one is that, even if running this locally in a single machine with the DirectRunner, records can come unordered (event_3 is processed before event_2 in my case). This is done on purpose to simulate distributed processing as documented here.

最后一个是,如果您得到这样的堆栈跟踪:

The last one is that if you get a stack trace like this:

TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Write Results/Write/WriteImpl/WriteBundles']

从2.10.0/2.11.0 SDK降级到2.9.0.例如,请参见此 answer .

downgrade from 2.10.0/2.11.0 SDK to 2.9.0. See this answer for example.

这篇关于未合并Apache Beam每用户会话窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆