使用Python时如何识别特殊的eol字符? [英] How to recognize special eol character when I see it, using Python?

查看:356
本文介绍了使用Python时如何识别特殊的eol字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python抓取一组原始的pdf文件.让他们发短信后,我很难排成一行.我不知道什么是行分隔符.麻烦是,我仍然不知道.

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.

它不是'\n',也不是'\r\n'.但是,我设法隔离了这些特殊字符之一.我确实将其存储在内存中,并且通过调用my_str.replace(eol, ''),可以从一个文件中删除所有这些字符.

It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.

所以我的问题是开放性的.当涉及到unicode之类的时候,我有点迷失了.如何在我的文件中识别此字符而又无需进行一些荒谬的操作,例如将其序列化然后读入?也许有一种方法可以将其称为代码吗?我无法让Python产生它实际上是什么.我所看到的只是打印还是调用unicode(special_eol)都是换行符.

So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.

请帮助!谢谢,对不起,如果我错过了明显的内容.

Please help! Thanks, and sorry if I'm missing something obvious.

推荐答案

要确定具体字符,可以使用str.encode('unicode_escape')

To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:

>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

这篇关于使用Python时如何识别特殊的eol字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆