算法上的问题:确定“用户会话". [英] Algorithmic issue: determining "user sessions"

查看:90
本文介绍了算法上的问题:确定“用户会话".的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个真正有趣的问题(至少对我来说)要解决(不是,这不是家庭作业).等效于此:您需要确定用户在计算机前所处的会话"以及会话开始和结束时间".

您将获得进行任何用户交互的时间以及最长的不活动时间.如果两个用户输入之间的闲置时间大于或等于非活动时间,则它们是不同会话的一部分.

基本上,我得到的输入是这样的(输入不进行排序,我宁愿在确定会话之前也不对它们进行排序):

06:38
07:12
06:17
09:00
06:49
07:37
08:45
09:51
08:29

例如,闲置30分钟.

然后我需要找到三个会话:

[06:17...07:12]
[07:37...09:00]
[09:51...09:51]

如果将闲置时间设置为12小时,那么我将发现一个大型会话:

[06:17...09:51]

我该如何简单地解决这个问题?

有一个最小的闲置有效时间,大约为15分钟.

我不想事先排序的原因是,我会得到很多数据,并且仅将它们存储在内存中是有问题的.但是,这些数据中的大多数应属于同一会话(与数据量相比,会话数应相对较少,可能是数千到1 [每会话数千个用户输入]).

到目前为止,我正在考虑读取输入(例如06:38)并定义一个间隔[data-max_inactivity ... data + max_inactivity],对于每个新输入,请使用二项式( log n )进行搜索,以了解它是否属于已知间隔或创建新间隔.

我将对每个输入重复此操作,使解决方案 n log n AFAICT.同样,好事是它不会使用太多内存,因为它只会创建间隔(大多数输入将落入一个已知间隔).

此外,每次在一个已知间隔内时,我都必须更改间隔的下限或上限,然后查看是否需要与下一个间隔合并".例如(最长不活动时间为30分钟):

[06:00...07:00]  (because I got 06:30)
[06:00...07:00][07:45...08:45]   (because I later got 08:15)
[06:00...08:45] (because I just received 07:20)

我不知道描述是否很清楚,但这就是我需要做的.

这样的问题有名字吗?您将如何解决它?

编辑

如果我打算按照计划的方式解决数据结构,我对知道应该使用哪种数据结构非常感兴趣.我同时需要 log n 搜索和插入/合并功能.

解决方案

您要的是在线算法,即可以在每个新的输入时间内递增地计算一组新会话的算法. /p>

关于当前会话集的数据结构的选择,您可以使用平衡的二进制搜索树.每个会话由开始时间和结束时间的一对(start,end)表示.搜索树的节点按其start时间排序.由于您的会话至少由max_inactivity隔开,即没有两个会话重叠,因此这将确保end时间也按顺序排列.换句话说,按开始时间排序将已经连续对会话进行排序.

这里有一些用于插入的伪代码.为了符号上的方便,我们假设sessions是一个数组,尽管它实际上是一个二进制搜索树.

insert(time,sessions) = do
    i <- find index such that
         sessions[i].start <= time && time < session[i+1].start

    if (sessions[i].start + max_inactivity >= time)
        merge  time  into  session[i]
    else if (time >= sessions[i+1].start - max_inactivity)
        merge  time  into  sessions[i+1]
    else
        insert  (time,time)  into  sessions

    if (session[i] and session[i+1] overlap)
        merge  session[i] and session[i+1]

merge操作可以通过在二进制搜索树中删除元素并将其插入来实现.

此算法将花费时间O(n log m),其中m是最大会话数,您说的很小.

根据编程语言的不同,实现平衡的二进制搜索树是不容易的.这里的关键是您必须根据一个密钥拆分树,并且并非每个现成的库都支持该操作.对于Java,我将使用TreeSet<E>类.如前所述,元素类型E是由开始时间和结束时间给定的单个会话.它的floor()ceiling()方法将检索我在伪代码中用sessions[i]sessions[i+1]表示的会话.

I've got a real little interesting (at least to me) problem to solve (and, no, it is not homework). It is equivalent to this: you need to determine "sessions" and "sessions start and end time" a user has been on in front of his computer.

You get the time at which any user interaction was made and a maximum period of inactivity. If a time greater or equal than the period of inactivity elapsed between two user inputs, then they are part of different sessions.

Basically the input I get are this (the inputs aren't sorted and I'd rather not sort them before determining the sessions):

06:38
07:12
06:17
09:00
06:49
07:37
08:45
09:51
08:29

And, say, a period of inactivity of 30 minutes.

Then I need to find three sessions:

[06:17...07:12]
[07:37...09:00]
[09:51...09:51]

If the period of inactivity is set to 12 hours, then I'd just find one big session:

[06:17...09:51]

How can I solve this simply?

There's a minimum valid period of inactivity, which shall be about 15 minutes.

The reason I'd rather not sort beforehand is that I'll get a lot of data and only storing them in memory be problematic. However, most of these data shall be part of the same sessions (there shall be relatively few sessions compared to the amount of data, maybe something like thousands to 1 [thousands of user inputs per session]).

So far I am thinking about reading an input (say 06:38) and defining an interval [data-max_inactivity...data+max_inactivity] and, for each new input, use a dichotomic (log n) search to see if it falls in a known interval or create a new interval.

I'd repeat this for every input, making the solution n log n AFAICT. Also, the good thing is that it wouldn't use too much memory for it would only create intervals (and most inputs will fall in a known interval).

Also, every time if falls in a known interval, I'd have to change the interval's lower or upper bound and then see if I need to "merge" with the next interval. For example (for a max period of inactivity of 30 minutes):

[06:00...07:00]  (because I got 06:30)
[06:00...07:00][07:45...08:45]   (because I later got 08:15)
[06:00...08:45] (because I just received 07:20)

I don't know if the description is very clear, but that is what I need to do.

Does such a problem have a name? How would you go about solving it?

EDIT

I'm very interested in knowing which kind of data structure I should use if I plan to solve it the way I plan to. I need both log n search and insertion/merging ability.

解决方案

You are asking for an online algorithm, i.e. one that can calculate a new set of sessions incrementally for each new input time.

Concerning the choice of data structure for the current set of sessions, you can use a balanced binary search tree. Each sessions is represented by a pair (start,end) of start time and end time. The nodes of the search tree are ordered by their start time. Since your sessions are separated by at least max_inactivity, i.e. no two sessions overlap, this will ensure that the end times are ordered as well. In other words, ordering by start times will already order the sessions consecutively.

Here some pseudo-code for insertion. For notational convenience, we pretend that sessions is an array, though it's actually a binary search tree.

insert(time,sessions) = do
    i <- find index such that
         sessions[i].start <= time && time < session[i+1].start

    if (sessions[i].start + max_inactivity >= time)
        merge  time  into  session[i]
    else if (time >= sessions[i+1].start - max_inactivity)
        merge  time  into  sessions[i+1]
    else
        insert  (time,time)  into  sessions

    if (session[i] and session[i+1] overlap)
        merge  session[i] and session[i+1]

The merge operation can be implemented by deleting and inserting elements into the binary search tree.

This algorithm will take time O(n log m) where m is the maximum number of sessions, which you said is rather small.

Granted, implementing a balanced binary search tree is no easy task, depending on the programming language. The key here is that you have to split the tree according to a key and not every ready-made library supports that operation. For Java, I would use the TreeSet<E> class; as said, the element type E is a single session given by start and end time. Its floor() and ceiling() methods will retrieve the sessions I've denoted with sessions[i] and sessions[i+1] in my pseudo-code.

这篇关于算法上的问题:确定“用户会话".的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆