使用Python时如何识别特殊的eol字符? [英] How to recognize special eol character when I see it, using Python?
问题描述
我正在使用Python抓取一组原始的pdf文件.让他们发短信后,我很难排成一行.我不知道什么是行分隔符.麻烦是,我仍然不知道.
I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
它不是'\n'
,也不是'\r\n'
.但是,我设法隔离了这些特殊字符之一.我确实将其存储在内存中,并且通过调用my_str.replace(eol, '')
,可以从一个文件中删除所有这些字符.
It's not a '\n'
, or, I don't think, '\r\n'
. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, '')
, I can remove all of these characters from one of my files.
所以我的问题是开放性的.当涉及到unicode之类的时候,我有点迷失了.如何在我的文件中识别此字符而又无需进行一些荒谬的操作,例如将其序列化然后读入?也许有一种方法可以将其称为代码吗?我无法让Python产生它实际上是什么.我所看到的只是打印还是调用unicode(special_eol)
都是换行符.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol)
is the character in its functional usage as a newline.
请帮助!谢谢,对不起,如果我错过了明显的内容.
Please help! Thanks, and sorry if I'm missing something obvious.
推荐答案
要确定具体字符,可以使用str.encode('unicode_escape')
或
To determine what specific character that is, you can use str.encode('unicode_escape')
or repr()
to get (in Python 2) a ASCII-printable representation of the character:
>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'
这篇关于使用Python时如何识别特殊的eol字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!