在Python中使用lxml解析具有错误标头的文本文件 [英] Use lxml to parse text file with bad header in Python

查看:123
本文介绍了在Python中使用lxml解析具有错误标头的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用lxml的etree解析文本文件(本地存储).但是我所有的文件(数千个)都有标头,例如:

I would like to parse text files (stored locally) with lxml's etree. But all of my files (thousands) have headers, such as:

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ
 YVPEaPjyiqB4rV/GS2lj6A==

<SEC-DOCUMENT>0001193125-07-200376.txt : 20070913
<SEC-HEADER>0001193125-07-200376.hdr.sgml : 20070913
<ACCEPTANCE-DATETIME>20070913115715
ACCESSION NUMBER:       0001193125-07-200376
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      7
CONFORMED PERIOD OF REPORT: 20070630
FILED AS OF DATE:       20070913
DATE AS OF CHANGE:      20070913

在这种情况下,

和第一个<直到第51行(并非在所有情况下都不是51). xml部分开始如下:

and the first < isn't until line 51 in this case (and isn't 51 in all cases). The xml portions starts as follows:

</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>

我可以使用lxml即时处理此问题吗?还是应该使用流编辑器来忽略每个文件的标头?谢谢!

Can I handle this on-the-fly with lxml? Or should I use a stream editor to omit each file's header? Thanks!

这是我当前的代码和错误.

Here is my current code and error.

from lxml import etree
f = etree.parse('temp.txt')

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

FWIW,这是的链接文件.

推荐答案

鉴于这些文件有一个标准,可以编写一个适当的解析器而不用猜测,或者希望beautifulsoup可以使事情正确.这并不意味着它是最适合您的解决方案,但是在研究中肯定是可行的.

Given that there's a standard for these files, it's possible to write a proper parser rather than guessing at things, or hoping beautifulsoup gets things right. That doesn't mean it's the best answer for you, but it's certainly work looking at.

根据 http://www.sec.gov/info/edgar/pdsdissemspec910.pdf 您所拥有的(在PEM机柜内)是由提供的DTD定义的SGML文档.因此,首先转到第48-55页,在那里提取文本,并将其另存为"edgar.dtd".

According to the standard at http://www.sec.gov/info/edgar/pdsdissemspec910.pdf what you've got (inside the PEM enclosure) is an SGML document defined by the provided DTD. So, first go to pages 48-55, extract the text there, and save it as, say, "edgar.dtd".

我要做的第一件事是安装 SP ,并使用其工具来确保文档确实有效并可以通过该DTD进行解析,以确保您不会将大量时间浪费在无法实现的目标上.

The first thing I'd do is install SP and use its tools to make sure that the documents really are valid and parseable by that DTD, to make sure you don't waste a bunch of time on something that isn't going to pan out.

Python带有一个验证SGML解析器sgmllib.不幸的是,它从未完全完成,在2.6-2.7中已弃用(在3.x中已删除).但这并不意味着它将无法正常工作.因此,尝试一下,看看是否可行.

Python comes with a validating SGML parser, sgmllib. Unfortunately, it was never quite finished, and it's deprecated in 2.6-2.7 (and removed in 3.x). But that doesn't mean it won't work. So, try it and see if it works.

如果没有,我不知道Python中有什么好的替代方法.大部分SGML代码都使用C,C ++或Perl.但是,只要您愿意用C/Cython/boost-python/任何形式或使用ctypes编写自己的包装,就可以很容易地包装任何C或C ++库(我将从SP开始).您只需要包装顶级功能,而无需构建完整的绑定集.但是,如果您以前从未做过这样的事情,那可能不是学习的最佳时机.

If not, I don't know of any good alternatives in Python; most of the SGML code out there is in C, C++, or Perl. But you can wrap up any C or C++ library (I'd start with SP) pretty easily, as long as you're comfortable writing your own wrapped in C/Cython/boost-python/whatever or using ctypes. You only need to wrap up the top-level functions, not build a complete set of bindings. But if you've never done anything like this before, it's probably not the best time to learn.

或者,您可以包装命令行工具. SP随附nsgmls.还有一个用perl编写的,具有相同名称的好工具(我认为 http://savannah.nongnu.org/projects/perlsgml的一部分/,但我并不乐观.)还有许多其他工具.

Alternatively, you can wrap up a command-line tool. SP comes with nsgmls. There's another good tool written in perl with the same name (I think part of http://savannah.nongnu.org/projects/perlsgml/ but I'm not positive.) And dozens of other tools.

或者,当然,您可以使用perl(或C ++)而不是Python编写整个内容,或仅编写解析层.

Or, of course, you could write the whole thing, or just the parsing layer, in perl (or C++) instead of Python.

这篇关于在Python中使用lxml解析具有错误标头的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆