用Python处理Windows行尾 [英] Dealing with Windows line-endings in Python

查看:43
本文介绍了用Python处理Windows行尾的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Windows提供程序提供的700MB XML文件.

I've got a 700MB XML file coming from a Windows provider.

正如人们所期望的那样,行尾为'\ r \ n'(或vi中的^ M).除了让供应商发送'\ n':-)

As one might expect, the line endings are '\r\n' (or ^M in vi). What is the most efficient way to deal with this situation aside from getting the supplier to send over '\n' :-)

  1. 使用 os.linesep
  2. 使用 rstrip()(要求打开文件......
  3. 在我的Mac Snow Leopard上,通用换行支持不是标准功能-所以这不是一个选择.
  1. Use os.linesep
  2. Use rstrip() (requiring opening the file ... which seems crazy)
  3. Using Universal newline support is not standard on my Mac Snow Leopard - so isn't an option.

我对需要Python 2.6+的任何东西都开放,但是它需要在Snow Leopard和Ubuntu 9.10上运行,并且外部需求最少.我不介意降低性能,但我正在寻找解决此问题的最佳标准方法.

I'm open to anything that requires Python 2.6+ but it needs to work on Snow Leopard and Ubuntu 9.10 with minimal external requirements. I don't mind a small performance penalty but I am looking for the standard best way to deal with this.

----编辑----

----edit----

行尾在标签描述符的中间,否则它们不会成为问题.我知道这是错误的表格,因此他们不应该将其发送给我,但这是我拥有文件的方式,而供应商大多不称职.

The line endings are in the middle of the tag descriptors, otherwise they wouldn't be such a problem. I know this is bad form and that they shouldn't be sending this to me, but this is how I have the file and the vendor is mostly incompetent.

推荐答案

据称:"这个人在标签描述符中间正好\ r \ n像这样:< ParentRedirec tSequenceID> ".

Allegedly: """This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>""".

我在这里看不到 \ r \ n .也许您的意思是repr(xml)包含类似

I see no \r\n here. Perhaps you mean repr(xml) contains things like

"<ParentRedirec\r\ntSequenceID>"

如果不是,请尝试用精确的示例准确地说出您的意思.

If not, try to say precisely what you mean, with repr-fashion examples.

以下应能工作:

>>> import re
>>> guff = """<atag>\r\n<bt\r\nag c="2">"""
>>> re.sub(r"(<[^>]*)\r\n([^>]*>)", r"\1\2", guff)
'<atag>\r\n<btag c="2">'
>>>

如果标签中有多个换行符,例如< foo \ r \ nbar \ r \ nzot> 只会修复第一个.替代方法(1)循环播放,直到凝块停止收缩为止(2)自己编写一个更智能的正则表达式:-)

If there is more than one line break in a tag e.g. <foo\r\nbar\r\nzot> this will fix only the first. Alternatives (1) loop until the guff stops shrinking (2) write a smarter regexp yourself :-)

这篇关于用Python处理Windows行尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆