解析 WhatsApp 对话日志 [英] parse a whatsApp conversation log

查看:98
本文介绍了解析 WhatsApp 对话日志的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为 WhatsApp 的对话日志编写解析器.问题末尾的最小日志文件.

I am trying to write a parser for the conversation log of WhatsApp. A minimal log file at the end of the question.

在这个日志中,有两种消息,正常的,语法是

In this log, there are two kind of message, the normal ones, where the syntax is

date time: Name: Message

如您所见,Message 可以换行,名称可以包含 :.

As you can see, the Message could go to newline, and the name could contain :.

第二种消息是事件"消息,可以是以下类型:

The second kind of messages are "event" messages, which could be of the following types:

date time: Name joined
date time: Name left
date time: Name was removed
date time: Name changed the subject to "GroupName"
date time: Name changed the group icon

我试着写下一些正则表达式,但我遇到的困难有几个:如何处理多行消息,如何解析Name字段(因为在:上拆分不起作用),如何构建一个仅识别来自当前组中发件人的消息的正则表达式,以及最终如何解析特殊消息(例如,解析搜索加入作为最后一个词不是一个好主意).

I tried to write down some regex, but the difficulties that I encountered are several: how to handle multiline messages, how to parse Name field (because splitting on : does not work), how to build a regex that recognize messages only from senders that currently are in the group and finally how to parse the special messages (for example, parsing searching for joined as last word it is not a good idea).

如何解析这样的日志文件并将所有内容移动到字典中?

How can I parse such a log file and move everything to a dictionary?

更准确地说,为了回答评论中的问题,我正在考虑的输出类似于嵌套字典:第一层的键是每个发送者,第二层区分事件"(例如加入、离开等)和消息",并将所有内容作为元组列表.

More precisely,to answer the question in the comment, the output i was thinking about was something like a nested dict: where in the first level the keys are each sender, on the second level made a distinction between 'Events' (such join, left etc.) and 'Message', and putting everything as a list of tuples.

>>>datab[Sender1]['Events']
>>>[('Joined',data1,time1),('Left',data2,time2]

>>>datab[Sender2]['Messages']
>>>[(data1,time1,Message1),(data2,time2,Message2)]

但如果你能想到更智能的格式,那就去吧!

But if you could think of a more intelligent format, go for it!

29/03/14 15:48:05: John Smith changed the subject to "Test"

29/03/14 16:10:39: John Smith joined

29/03/14 16:10:40: Person:2 joined

29/03/14 16:10:40: John Smith: Hello!

29/03/14 16:11:40: Person:2: some random words,

29/03/14 16:12:40: Person3 joined

29/03/14 16:13:40: John Smith: Hello!Test message with newline
Another line of the same message
Another line.

29/03/14 16:14:43: Person:2: Test message using as last word joined

29/03/14 16:15:57: Person3 left

29/03/14 16:17:16: Person3 joined

29/03/14 16:18:21: Person:2 changed the group icon

29/03/14 16:19:16: Person3 was removed 

29/03/14 16:20:43: Person:2: Test message using as last word left

推荐答案

你可以使用这个模式:

(?P<datetime>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}): (?P<name>\w+(?::\s*\w+)*|[\w\s]+?)(?:\s+(?P<action>joined|left|was removed|changed the (?:subject to "\w+"|group icon))|:\s(?P<message>(?:.+|\n(?!\n))+))

演示

为了处理多行消息,我禁止使用负向前瞻连续换行符.但是,您可以通过在 \n

To deal with multiline message, I forbid with a negative lookahead consecutive newline characters. However, you can make the pattern more tolerant by adding the start of the next block or the end of the string in the lookahead after the \n

这篇关于解析 WhatsApp 对话日志的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆