用Python处理Windows行尾 [英] Dealing with Windows line-endings in Python
问题描述
我有一个Windows提供程序提供的700MB XML文件.
I've got a 700MB XML file coming from a Windows provider.
正如人们所期望的那样,行尾为'\ r \ n'(或vi中的^ M).除了让供应商发送'\ n':-)
As one might expect, the line endings are '\r\n' (or ^M in vi). What is the most efficient way to deal with this situation aside from getting the supplier to send over '\n' :-)
- 使用 os.linesep
- 使用 rstrip()(要求打开文件......
- 在我的Mac Snow Leopard上,通用换行支持不是标准功能-所以这不是一个选择.
- Use os.linesep
- Use rstrip() (requiring opening the file ... which seems crazy)
- Using Universal newline support is not standard on my Mac Snow Leopard - so isn't an option.
我对需要Python 2.6+的任何东西都开放,但是它需要在Snow Leopard和Ubuntu 9.10上运行,并且外部需求最少.我不介意降低性能,但我正在寻找解决此问题的最佳标准方法.
I'm open to anything that requires Python 2.6+ but it needs to work on Snow Leopard and Ubuntu 9.10 with minimal external requirements. I don't mind a small performance penalty but I am looking for the standard best way to deal with this.
----编辑----
----edit----
行尾在标签描述符的中间,否则它们不会成为问题.我知道这是错误的表格,因此他们不应该将其发送给我,但这是我拥有文件的方式,而供应商大多不称职.
The line endings are in the middle of the tag descriptors, otherwise they wouldn't be such a problem. I know this is bad form and that they shouldn't be sending this to me, but this is how I have the file and the vendor is mostly incompetent.
推荐答案
据称:"这个人在标签描述符中间正好\ r \ n像这样:< ParentRedirec tSequenceID>
".
Allegedly: """This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>
""".
我在这里看不到 \ r \ n
.也许您的意思是repr(xml)包含类似
I see no \r\n
here. Perhaps you mean repr(xml) contains things like
"<ParentRedirec\r\ntSequenceID>"
如果不是,请尝试用精确的示例准确地说出您的意思.
If not, try to say precisely what you mean, with repr-fashion examples.
以下应能工作:
>>> import re
>>> guff = """<atag>\r\n<bt\r\nag c="2">"""
>>> re.sub(r"(<[^>]*)\r\n([^>]*>)", r"\1\2", guff)
'<atag>\r\n<btag c="2">'
>>>
如果标签中有多个换行符,例如< foo \ r \ nbar \ r \ nzot>
只会修复第一个.替代方法(1)循环播放,直到凝块停止收缩为止(2)自己编写一个更智能的正则表达式:-)
If there is more than one line break in a tag e.g. <foo\r\nbar\r\nzot>
this will fix only the first. Alternatives (1) loop until the guff stops shrinking (2) write a smarter regexp yourself :-)
这篇关于用Python处理Windows行尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!