从 pandas 日志文件分析会话生成 [英] Session generation from log file analysis with pandas
问题描述
'65 .55.52.118 - - [30 / May / 2013:06:58:52 -0600]GET /detailedAddVen.php? refId = 7954& uId = 2802 HTTP / 1.1200 4514 - Mozilla / 5.0(兼容; bingbot / 2.0; + 和a href =http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cumsum.html =nofollow>
cumsum
(这里是一个简单的例子,数字而不是时间 - 但它们的工作原理相同):11]:s = pd.Series([1.,1.1,1.2,2.7,3.2,3.8,3.9])
在[12]中:(s - s.shift(1)> 0.5).fillna(0).cumsum(skipna = False)#*
Out [12]:
0 0
1 0
2 0
3 1
4 1
5 2
6 2
dtype:int64
* 需要skipna = False似乎是一个错误。
然后,您可以在 groupby
apply
:在[21]中:df = pd.DataFrame([[1.1,1.7,2.5,2.6,2.7,3.4],列表('AAABBB')])T
在[22]中:df.columns = ['time','ip']
在[23]中:df
Out [23]:
时间ip
0 1.1 A
1 1.7 A
2 2.5 A
3 2.6 B
4 2.7 B
5 3.4 B
在[24]中:g = df.groupby('ip')
在[25]中:df ['session_number'] = g ['time']。 s:(s - s.shift(1)> (0).cumsum(skipna = False))
在[26]中:df
输出[26]:
时间ip session_number
0 1.1 A 0
1 1.7 A 1
2 2.5 A 2
3 2.6 B 0
4 2.7 B 0
5 3.4 B 1
现在你可以通过
'ip'
和'session_number'
(并分析每个会话)。I'm analysing a Apache log file and I have imported it in to a pandas dataframe.
'65.55.52.118 - - [30/May/2013:06:58:52 -0600] "GET /detailedAddVen.php?refId=7954&uId=2802 HTTP/1.1" 200 4514 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"'
My dataframe:
I want to group this in to sessions based on IP, Agent and Time difference (If the duration of time is greater than 30 mins it should be a new session).It is easy to group the dataframe by IP and Agent but how to check this time difference?Hope the problem is clear.
sessions = df.groupby(['IP', 'Agent']).size()
UPDATE : df.index is like follows:
<class 'pandas.tseries.index.DatetimeIndex'> [2013-05-30 06:00:41, ..., 2013-05-30 22:29:14] Length: 31975, Freq: None, Timezone: None
解决方案I would do this using a
shift
and acumsum
(here's a simple example, with numbers instead of times - but they would work exactly the same):In [11]: s = pd.Series([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9]) In [12]: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False) # * Out[12]: 0 0 1 0 2 0 3 1 4 1 5 2 6 2 dtype: int64
* the need for skipna=False appears to be a bug.
Then you can use this in a groupby
apply
:In [21]: df = pd.DataFrame([[1.1, 1.7, 2.5, 2.6, 2.7, 3.4], list('AAABBB')]).T In [22]: df.columns = ['time', 'ip'] In [23]: df Out[23]: time ip 0 1.1 A 1 1.7 A 2 2.5 A 3 2.6 B 4 2.7 B 5 3.4 B In [24]: g = df.groupby('ip') In [25]: df['session_number'] = g['time'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False)) In [26]: df Out[26]: time ip session_number 0 1.1 A 0 1 1.7 A 1 2 2.5 A 2 3 2.6 B 0 4 2.7 B 0 5 3.4 B 1
Now you can groupby
'ip'
and'session_number'
(and analyse each session).这篇关于从 pandas 日志文件分析会话生成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!