解析 WhatsApp 消息:如何解析多行文本 [英] Parsing WhatsApp messages: how to parse multiline texts

查看:83
本文介绍了解析 WhatsApp 消息:如何解析多行文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 WhatsApp 消息文件,我想将其保存为 csv 格式.文件如下所示:

I have a file of WhatsApp messages which I want to save into csv format. File looks like this:

[04/02/2018, 20:56:55] Name1: 此聊天和通话的消息现在通过端到端加密进行保护.
[04/02/2018, 20:56:55] Name1: Content1.
更多内容.
[04/02/2018, 23:24:44] Name2:Content2.

[04/02/2018, 20:56:55] Name1: ‎Messages to this chat and calls are now secured with end-to-end encryption.
[04/02/2018, 20:56:55] Name1: Content1.
More content.
[04/02/2018, 23:24:44] Name2: Content2.

我想将消息解析为 date、sender、text 列.我的代码:

I want to parse messages into date, sender, text columns. My code:

with open('chat.txt', "r") as infile, open("Output.txt", "w") as outfile:
    for line in infile:
        date = datetime.strptime(
            re.search('(?<=\[)[^]]+(?=\])', line).group(), 
            '%d/%m/%Y, %H:%M:%S')
        sender = re.search('(?<=\] )[^]]+(?=\:)', line).group()
        text = line.rsplit(']', 1)[-1].rsplit(': ', 1)[-1]

        new_line = str(date) + ',' + sender + ',' + text
        outfile.write(new_line)

我在处理多行文本时遇到问题.(我有时会在我的消息中跳到一个新行 - 在这种情况下,我在该行中只有应该是前一行的一部分的文本.)我也对解析日期时间、发件人和文本的更多 Pythonic 方式持开放态度.我的代码的结果是错误的,因为每一行都没有所有条件(但正确解析日期、发件人、文本):

I have problems with handling multi line texts. (I sometimes skipped into a new line in my messages - in this case I have only text in the line which is supposed to be a part of the previous line.) I'm also open to more pythonic way of parsing datetime, sender, and text. The result of my code is error because every line doesn't have all criteria (but correctly parses date, sender, text):

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-33-efbcb430243d> in <module>()
      3     for line in infile:
      4         date = datetime.strptime(
----> 5             re.search('(?<=\[)[^]]+(?=\])', line).group(),
      6             '%d/%m/%Y, %H:%M:%S')
      7         sender = re.search('(?<=\] )[^]]+(?=\:)', line).group()

AttributeError: 'NoneType' object has no attribute 'group'

想法:也许使用 try-catch 然后以某种方式添加仅包含文本的行?(听起来不像 Pythonic.)

Idea: maybe using try-catch and then somehow appending line with only text? (Doesn't sound Pythonic.)

推荐答案

这里有一些东西可以将额外的文本附加到前一行.

Here is something that should work to append the extra text to the previous line.

这是检查正则表达式是否失败,在这种情况下,只需将行写入文件而不使用换行符 \n 以便它只是附加到文件中的前一行.

This is checking whether the regex fails, in which case just write the line to the file without a newline \n so it just appends to the previous line in the file.

start = True

with open('chat.txt', "r") as infile, open("Output.txt", "w") as outfile:
    for line in infile:
        time = re.search(r'(?<=\[)[^]]+(?=\])', line)
        sender = re.search(r'(?<=\] )[^]]+(?=\:)', line)
        if sender and time:
            date = datetime.strptime(
                time.group(),
                '%d/%m/%Y, %H:%M:%S')
            sender = sender.group()
            text = line.rsplit(r'].+: ', 1)[-1]
            new_line = str(date) + ',' + sender + ',' + text
            if not start: new_line = '\n' + new_line
            outfile.write(new_line)
        else:
            outfile.write(' ' + line)
        start = False

看起来即使正则表达式有效,您也没有在文件中写入新行,因此我也添加了这一点.

It also looks like you weren't writing new lines to the file even when the regex worked, so I added that in too.

这篇关于解析 WhatsApp 消息:如何解析多行文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆