使用XML :: LibXML解析XML时,格式错误的UTF-8字符(致命)错误 [英] Malformed UTF-8 character (fatal) error while parsing XML using XML::LibXML

查看:94
本文介绍了使用XML :: LibXML解析XML时,格式错误的UTF-8字符(致命)错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用XML :: LibXML解析XML文件.对于以下XML条目,我得到了错误:

I am parsing XML files using XML::LibXML. For the following XML entry I get the error:

Malformed UTF-8 character (fatal) at C:/Perl64/site/lib/XML/LibXML/Error.pm line 217

$context=~s/[^\t]/ /g;

XML中的条目如下

<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">15177811</PMID>
<DateCreated>
<Year>2004</Year>
<Month>06</Month>
<Day>04</Day>
</DateCreated>
<DateCompleted>
<Year>2004</Year>
<Month>08</Month>
<Day>11</Day>
</DateCompleted>
<DateRevised>
<Year>2011</Year>
<Month>04</Month>
<Day>07</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0278-2626</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>55</Volume>
<Issue>2</Issue>
<PubDate>
<Year>2004</Year>
<Month>Jul</Month>
</PubDate>
</JournalIssue>
<Title>Brain and cognition</Title>
<ISOAbbreviation>Brain Cogn</ISOAbbreviation>
</Journal>
<ArticleTitle>Efficiency of orientation channels in the striate cortex for distributed categorization process.</ArticleTitle>
<Pagination>
<MedlinePgn>352-4</MedlinePgn>
</Pagination>
<Affiliation>Cognitive Science Department, Université de Liège, Belgium. mmermillod@ulg.ac.be</Affiliation>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Mermillod</LastName>
<ForeName>Martial</ForeName>
<Initials>M</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Chauvin</LastName>
<ForeName>Alan</ForeName>
<Initials>A</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Guyader</LastName>
<ForeName>Nathalie</ForeName>
<Initials>N</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType>Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>Brain Cogn</MedlineTA>
<NlmUniqueID>8218014</NlmUniqueID>
<ISSNLinking>0278-2626</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<CommentsCorrectionsList>
<CommentsCorrections RefType="ErratumIn">
<RefSource>Brain Cogn. 2005 Jul;58(2):245</RefSource>
</CommentsCorrections>
<CommentsCorrections RefType="RepublishedIn">
<RefSource>Brain Cogn. 2005 Jul;58(2):246-8</RefSource>
<PMID Version="1">16044513</PMID>
</CommentsCorrections>
</CommentsCorrectionsList>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="Y">Neural Networks (Computer)</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Neurons</DescriptorName>
<QualifierName MajorTopicYN="N">physiology</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Orientation</DescriptorName>
<QualifierName MajorTopicYN="Y">physiology</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Pattern Recognition, Visual</DescriptorName>
<QualifierName MajorTopicYN="Y">physiology</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Visual Cortex</DescriptorName>
<QualifierName MajorTopicYN="Y">physiology</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>

但是我要从该条目中获取的内容是PMID,DateRevised,PubDate,ArticleTitle,CommentsCorrectionList和MeshHeadingList.但是,如果我删除包含其他字符的从属关系,则此错误不再存在.我该如何解决该错误?

But the things I want out of this entry is PMID, DateRevised, PubDate, ArticleTitle, CommentsCorrectionList, and MeshHeadingList. But, if I remove Affiliation which contains some other character this error is no more. How should I fix this error?

推荐答案

您可以将文件转换为指定的编码(UTF-8),也可以指定文件实际使用的编码. (<?xml version="1.0" encoding="cp1252"?>).

You could either convert the file to the specified encoding (UTF-8), or you can specify the encoding actually used for the file. (<?xml version="1.0" encoding="cp1252"?>).

记事本可用于转换为UTF-8,Perl也可以:

Notepad can be used to convert to UTF-8, and so can Perl:

perl -pe"
   BEGIN {
      binmode STDIN,  ':encoding(cp1252)';
      binmode STDOUT, ':encoding(UTF-8)';
   }
" < file.cp1252 > file.UTF-8

(为了可读性,您必须删除我添加的换行符.)

(You'll have to remove the line breaks I've added for readability.)

这篇关于使用XML :: LibXML解析XML时,格式错误的UTF-8字符(致命)错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆