解析 WhatsApp 消息:如何解析多行文本 [英] Parsing WhatsApp messages: how to parse multiline texts
问题描述
我有一个 WhatsApp 消息文件,我想将其保存为 csv 格式.文件如下所示:
I have a file of WhatsApp messages which I want to save into csv format. File looks like this:
[04/02/2018, 20:56:55] Name1: 此聊天和通话的消息现在通过端到端加密进行保护.
[04/02/2018, 20:56:55] Name1: Content1.
更多内容.
[04/02/2018, 23:24:44] Name2:Content2.
[04/02/2018, 20:56:55] Name1: Messages to this chat and calls are now secured with end-to-end encryption.
[04/02/2018, 20:56:55] Name1: Content1.
More content.
[04/02/2018, 23:24:44] Name2: Content2.
我想将消息解析为 date、sender、text
列.我的代码:
I want to parse messages into date, sender, text
columns. My code:
with open('chat.txt', "r") as infile, open("Output.txt", "w") as outfile:
for line in infile:
date = datetime.strptime(
re.search('(?<=\[)[^]]+(?=\])', line).group(),
'%d/%m/%Y, %H:%M:%S')
sender = re.search('(?<=\] )[^]]+(?=\:)', line).group()
text = line.rsplit(']', 1)[-1].rsplit(': ', 1)[-1]
new_line = str(date) + ',' + sender + ',' + text
outfile.write(new_line)
我在处理多行文本时遇到问题.(我有时会在我的消息中跳到一个新行 - 在这种情况下,我在该行中只有应该是前一行的一部分的文本.)我也对解析日期时间、发件人和文本的更多 Pythonic 方式持开放态度.我的代码的结果是错误的,因为每一行都没有所有条件(但正确解析日期、发件人、文本):
I have problems with handling multi line texts. (I sometimes skipped into a new line in my messages - in this case I have only text in the line which is supposed to be a part of the previous line.) I'm also open to more pythonic way of parsing datetime, sender, and text. The result of my code is error because every line doesn't have all criteria (but correctly parses date, sender, text):
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-33-efbcb430243d> in <module>()
3 for line in infile:
4 date = datetime.strptime(
----> 5 re.search('(?<=\[)[^]]+(?=\])', line).group(),
6 '%d/%m/%Y, %H:%M:%S')
7 sender = re.search('(?<=\] )[^]]+(?=\:)', line).group()
AttributeError: 'NoneType' object has no attribute 'group'
想法:也许使用 try-catch 然后以某种方式添加仅包含文本的行?(听起来不像 Pythonic.)
Idea: maybe using try-catch and then somehow appending line with only text? (Doesn't sound Pythonic.)
推荐答案
这里有一些东西可以将额外的文本附加到前一行.
Here is something that should work to append the extra text to the previous line.
这是检查正则表达式是否失败,在这种情况下,只需将行写入文件而不使用换行符 \n
以便它只是附加到文件中的前一行.
This is checking whether the regex fails, in which case just write the line to the file without a newline \n
so it just appends to the previous line in the file.
start = True
with open('chat.txt', "r") as infile, open("Output.txt", "w") as outfile:
for line in infile:
time = re.search(r'(?<=\[)[^]]+(?=\])', line)
sender = re.search(r'(?<=\] )[^]]+(?=\:)', line)
if sender and time:
date = datetime.strptime(
time.group(),
'%d/%m/%Y, %H:%M:%S')
sender = sender.group()
text = line.rsplit(r'].+: ', 1)[-1]
new_line = str(date) + ',' + sender + ',' + text
if not start: new_line = '\n' + new_line
outfile.write(new_line)
else:
outfile.write(' ' + line)
start = False
看起来即使正则表达式有效,您也没有在文件中写入新行,因此我也添加了这一点.
It also looks like you weren't writing new lines to the file even when the regex worked, so I added that in too.
这篇关于解析 WhatsApp 消息:如何解析多行文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!