使用 Perl XML::DOM 模块的解析器错误,“引用无效字符号" [英] Parser error using Perl XML::DOM module, "reference to invalid character number"
问题描述
我是一个完整的 Perl 新手,但我确信学习 Perl 会比弄清楚如何在 awk 中解析 XML 更容易.我想从这个数据集中解析 .sgm 文件:
I am a complete Perl newb, but I am certain that learning Perl will be easier than figuring out how to parse XML in awk. I would like to parse the .sgm files from this dataset:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
这是十年前来自新闻专线的 20,000 篇路透社文章的集合,是用于某些类型文本处理的标准测试集.为了简化我的 perl 测试,我从第一个文件中抓取了前几百行并制作了 test.sgm 直到我的脚本在上面正确运行.它开始是这样的:
This is a collection of 20,000 Reuters articles from newswire from a decade ago, and is a standard test set for certain types of text processing. To simplify my perl testing, I grabbed the first few hundred lines from the first file and made test.sgm until my script worked correctly on that. It starts out like this:
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
C T
f0704reute
u f BC-BAHIA-COCOA-REVIEW 02-26 0105</UNKNOWN>
<TEXT>
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,...
我使用了来自 http://www 的 perl 脚本.以xml.com/pub/a/2001/05/16/perlxml.html为例,最后得到这个,extract.pl:
I used a perl script from http://www.xml.com/pub/a/2001/05/16/perlxml.html as an example, and ended up with this, extract.pl:
use XML::DOM;
my $file = $ARGV[0];
my $parser = XML::DOM::Parser->new();
my $doc = $parser->parsefile($file);
#print $doc->getElementsByTagName('DATE');
print "\n";
我得到这个输出:
> perl extract.pl test.sgm
reference to invalid character number at line 11, column 0, byte 343 at /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi/XML/Parser.pm line 187
>
Google 没有帮助(最热门的似乎是遇到与我相同的错误的页面)而且我的 Perl 黑客朋友仍然在拉斯维加斯的 Blackhat 工作.任何想法我做错了什么,或者我如何清理文件?我认为坏事发生在未知"标签内,我什至不需要.我真的只想从每篇文章中提取文本.如果您需要更多信息,请告诉我.
Google doesn't help (the top hit appears to be a page that is experiencing the same error I am) and my Perl hacker friend is still hung over from Blackhat in Vegas. Any ideas what I'm doing wrong, or how I can clean the file? I assume the badness is happening inside that "Unknown" tag, which I don't even need. I really just want to extract the text from every article. If you need more info please let me know.
推荐答案
数字字符引用"在有效的 XML 文档中是不合法的.我建议您参阅 XML 中的 4.1 字符和实体引用部分推荐:
The numeric character reference "" is not legal in valid XML Documents. I refer you to the section 4.1 Character and Entity References in the XML recommendation:
使用字符引用引用的字符必须与 Char 的产生式相匹配.
Characters referred to using character references MUST match the production for Char.
现在,如果我们点击链接并查看 生产字符:
Now if we follow the link and look at the production for Char:
字符 ::= #x9 |#xA |#xD |[#x20-#xD7FF] |[#xE000-#xFFFD] |[#x10000-#x10FFFF]
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
我们看到有些字符既不能按字面出现,也不能作为有效 XML 文档中的数字字符引用出现.
we see that there are some characters that can appear neither literally, nor as a numeric character reference in a valid XML Document.
奇怪的是;我今天学到了一些关于 XML 的知识 :)
An oddity that; I've learned something about XML today :).
请参阅有关 XML 中的 ASCII 控制字符的对话,了解一种可能的解决方法.
See this conversation on ASCII control characters in XML for a possible workaround.
这篇关于使用 Perl XML::DOM 模块的解析器错误,“引用无效字符号"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!