从 pandas 日志文件分析会话生成 [英] Session generation from log file analysis with pandas

查看：161 发布时间：2017/3/25 23:29:16 python pandas timedelta dataframe

本文介绍了从 pandas 日志文件分析会话生成的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在分析Apache日志文件，并将其导入到熊猫数据框中。

'65 .55.52.118 - - [30 / May / 2013：06：58：52 -0600]GET /detailedAddVen.php？ refId = 7954& uId = 2802 HTTP / 1.1200 4514 - Mozilla / 5.0（兼容; bingbot / 2.0; + 和a href =http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cumsum.html =nofollow> cumsum （这里是一个简单的例子，数字而不是时间 - 但它们的工作原理相同）：
  11]：s = pd.Series（[1.,1.1,1.2,2.7,3.2,3.8,3.9]）
 
在[12]中：（s  -  s.shift（1）> 0.5）.fillna（0）.cumsum（skipna = False）＃* 
 Out [12]：
 0 0 
 1 0 
 2 0 
 3 1 
 4 1 
 5 2 
 6 2 
 dtype：int64 
  
* 需要skipna = False似乎是一个错误。

然后，您可以在 groupby apply ：
 在[21]中：df = pd.DataFrame（[[1.1，1.7,2.5,2.6,2.7,3.4]，列表（'AAABBB'）]）T 
 
在[22]中：df.columns = ['time'，'ip'] 
 
在[23]中：df 
 Out [23]：
时间ip 
 0 1.1 A 
 1 1.7 A 
 2 2.5 A 
 3 2.6 B 
 4 2.7 B 
 5 3.4 B 
 
在[24]中：g = df.groupby（'ip'）
 
在[25]中：df ['session_number'] = g ['time']。 s：（s  -  s.shift（1）> （0）.cumsum（skipna = False））
 
在[26]中：df 
输出[26]：
时间ip session_number 
 0 1.1 A 0 
 1 1.7 A 1 
 2 2.5 A 2 
 3 2.6 B 0 
 4 2.7 B 0 
 5 3.4 B 1 
  
现在你可以通过'ip'和'session_number'（并分析每个会话）。

I'm analysing a Apache log file and I have imported it in to a pandas dataframe.

'65.55.52.118 - - [30/May/2013:06:58:52 -0600] "GET /detailedAddVen.php?refId=7954&uId=2802 HTTP/1.1" 200 4514 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"'

My dataframe:

I want to group this in to sessions based on IP, Agent and Time difference (If the duration of time is greater than 30 mins it should be a new session).

It is easy to group the dataframe by IP and Agent but how to check this time difference?Hope the problem is clear.
sessions = df.groupby(['IP', 'Agent']).size()
UPDATE : df.index is like follows:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-30 06:00:41, ..., 2013-05-30 22:29:14]
Length: 31975, Freq: None, Timezone: None
解决方案
I would do this using a shift and a cumsum (here's a simple example, with numbers instead of times - but they would work exactly the same):
In [11]: s = pd.Series([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])

In [12]: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False)  # *
Out[12]:
0    0
1    0
2    0
3    1
4    1
5    2
6    2
dtype: int64
* the need for skipna=False appears to be a bug.

Then you can use this in a groupby apply:
In [21]: df = pd.DataFrame([[1.1, 1.7, 2.5, 2.6, 2.7, 3.4], list('AAABBB')]).T

In [22]: df.columns = ['time', 'ip']

In [23]: df
Out[23]:
  time ip
0  1.1  A
1  1.7  A
2  2.5  A
3  2.6  B
4  2.7  B
5  3.4  B

In [24]: g = df.groupby('ip')

In [25]: df['session_number'] = g['time'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))

In [26]: df
Out[26]:
  time ip  session_number
0  1.1  A               0
1  1.7  A               1
2  2.5  A               2
3  2.6  B               0
4  2.7  B               0
5  3.4  B               1
Now you can groupby 'ip' and 'session_number' (and analyse each session).

这篇关于从 pandas 日志文件分析会话生成的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 pandas 日志文件分析会话生成 [英] Session generation from log file analysis with pandas

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从 pandas 日志文件分析会话生成 [英] Session generation from log file analysis with pandas

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭