剥离掉破坏readline()的多余字符 [英] Stripping out unwanted characters that are breaking readline()

查看:133
本文介绍了剥离掉破坏readline()的多余字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个小脚本,以遍历版权声明电子邮件的大文件夹并查找相关信息(IP和时间戳).我已经找到了一些小的格式化障碍的方法(IP和TS有时位于不同的行,有时在同一行,有时在不同的地方,时间戳有4种不同的格式,等等.)

I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

我遇到一个奇怪的问题,我正在解析的一些文件在一行的中间喷出了奇怪的字符,破坏了我对readline()的解析返回. 在文本编辑器中阅读时,所讨论的行看起来很正常,但readline()会在IP中间读入一个'='和两个'\ n'字符.

I ran into one weird problem where a few of the files I'm parsing through spew out weird characters in the middle of a line, ruining my parsing of readline() returns. When reading in a text editor, the line in question looks normal, but readline() reads an '=' and two '\n' characters right smack in the middle of an IP.

例如

Normal return from readline():
"IP Address: xxx.xxx.xxx.xxx"

Broken readline() return:
"IP Address: xxx.xxx.xxx="

The next two lines after that being:
""
".xxx"

有什么想法可以解决这个问题吗?我真的无法控制可能导致此问题的问题,我只是需要处理而不会太疯狂.

Any idea how I could get around this? I don't really have control over what problem could be causing this, I just kind of need to deal with it without getting too crazy.

相关功能,以供参考(我知道那是一团糟):

Relevant function, for reference (I know it's a mess):

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while  not ("Torrent Hash Value: " in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw = ce.readline()
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
    if ip:
        return ip[0]
        ce.close()
    else:
        return ("No IP found in: " + ipraw)
        ce.close()

推荐答案

似乎您正在处理的至少某些电子邮件已被编码为

It seems likely that at least some of the emails that you are processing have been encoded as quoted-printable.

此编码用于使8位字符数据可在7位(仅ASCII)系统上传输,但它也强制使用76个字符的固定行长.这是通过插入由"="和行尾标记组成的软换行符来实现的.

This encoding is used to make 8-bit character data transportable over 7-bit (ASCII-only) systems, but it also enforces a fixed line length of 76 characters. This is implemented by inserting a soft line break consisting of "=" followed by the end of line marker.

Python提供了 quopri 模块来处理引号-可打印的.从带引号的可打印内容中解码数据会删除这些换行符.

Python provides the quopri module to handle encoding and decoding from quoted-printable. Decoding your data from quoted-printable will remove these soft line breaks.

作为示例,让我们使用问题的第一段.

As an example, let's use the first paragraph of your question.

>>> import quopri
>>> s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.)."""

>>> # Encode to latin-1 as quopri deals with bytes, not strings.
>>> bs = s.encode('latin-1')

>>> # Encode
>>> encoded = quopri.encodestring(bs)
>>> # Observe the "=\n" inserted into the text.
>>> encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."

>>> # Printing without decoding from quoted-printable shows the "=".
>>> print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
 emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).

>>> # Decode from quoted-printable to remove soft line breaks.
>>> print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

要正确解码,需要处理整个消息正文,这与使用readline的方法相冲突.解决此问题的一种方法是将解码后的字符串加载到缓冲区中:

To decode correctly, the entire message body needs to be processed, which conflicts with your approach using readline. One way around this is to load the decoded string into a buffer:

import io

def getIP(em):
    with open(em, 'rb') as f:
        bs = f.read()
    decoded = quopri.decodestring(bs).decode('latin-1')

    ce = io.StringIO(decoded)
    iplabel = ""
    while  not ("Torrent Hash Value: " in iplabel):
        iplabel = ce.readline()
        ...

如果您的文件包含完整的电子邮件(包括标题),则使用电子邮件中的工具模块将自动处理此解码.

If your files contain complete emails - including headers - then using the tools in the email module will handle this decoding automatically.

import email
from email import policy

with open('message.eml') as f:
    s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()

这篇关于剥离掉破坏readline()的多余字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆