当Matlab中的一个XML文件中有一些特殊的UTF-8字符时，如何处理 [英] How to handle when some special UTF-8 characters are inside a XML file in matlab

查看：791 发布时间：2017/8/17 1:50:25 xml matlab parsing encoding utf-8

本文介绍了当Matlab中的一个XML文件中有一些特殊的UTF-8字符时，如何处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有几个xml文件来处理。样本文件在下面给出

   DOC> 
< DOCNO> 2431.eng< / DOCNO> 
< TITLE>在Cajamarca附近的Bañosdel Inca的温泉< / TITLE> 
< DESCRIPTION>观看几个有蒸汽水的游泳池;人，房子和
树后面，远处的山脉;< / DESCRIPTION> 
< NOTES>直到1532年，这个地方被称为Pulltumarca，在西班牙人到来之前，它被重命名为
Bañosdel Inca（Inka的浴室）。 
今天，巴尼奥斯德尔印加是秘鲁访问量最大的治疗浴。< / NOTES> 
< LOCATION> Cajamarca，秘鲁< / LOCATION> 
< / DOC>

在使用xmlread（）matlab函数时，我会收到以下错误。 2431.eng：3：29：4字节的UTF-8序列的无效字节2，

  [致命错误] 
 ???发生Java异常：
 org.xml.sax.SAXParseException：4字节UTF-8序列的无效字节2。 
在org.apache.xerces.parsers.DOMParser.parse（未知来源）
在org.apache.xerces.jaxp.DocumentBuilderImpl.parse（未知来源）
在javax.xml.parsers .DocumentBuilder.parse（Unknown Source）
 
 ==中的错误> xmlread为98 
 parseResult = p.parse（fileName）;

有什么建议如何解决这个问题？

解决方案

您发布的示例工作正常。

如错误消息所示，我认为您的实际文件不正确编码。请记住，并不是所有可能的字节序列都是有效的UTF-8序列： http：//en.wikipedia .org / wiki / UTF-8＃Invalid_byte_sequences

快速检查方法是在Firefox中打开文件。如果XML文件有编码问题，您会看到如下错误消息：

XML解析错误：格式不正确

编辑：

所以我看看文件：您的问题是，XML解析器将文件视为<？xml？？> 声明行为UTF-8，但您的文件看起来被编码为 ISO-8859-1 （拉丁文1）或 Windows-1252 （CP-1252）。

例如，SAX解析器以下列标记窒息： code>洛斯巴诺斯。这个字符n letter with tilde，这是 U + 00F1 ，在ISO-8859-1中的两个编码中具有不同的表示形式：

，它表示为一个字节：0xF1

在UTF-8中，它表示为两个字节：0xC3 0xB1

虽然 UTF-8 旨在向后兼容 ASCII ，字符ñ落入扩展ASCII范围，全部表示为两个或多个字节在UTF-8中。

所以当在Latin-1中存储的子串ño为 11110001 01101111 被解释为UTF-8编码，解析器看到第一个字节，将其识别为4字节UTF-8 seque的开始形式为 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 。但是由于它显然不符合该格式，所以会抛出错误：

org.xml.sax.SAXParseException：无效的字节2/4 -byte UTF-8序列。

底线是：始终使用XML声明！在你的情况下，在所有文件的开始处添加以下行：

 <？xml version =1.0 = ISO-8859-1 >？;或者更好的是，修改生成这些文件的程序来写入所述行。
  
 
 此更改后，MATLAB（或真正的Java）应该能够正确读取XML文件：
 >> doc = xmlread（'2431.eng'）; 
>> doc.saveXML（[]）
 ans = 
<？xml version =1.0encoding =UTF-16？> 
< DOC> 
< DOCNO>注解/ 02 / 2431.eng< / DOCNO> 
< TITLE>在Cajamarca附近的Bañosdel Inca的温泉< / TITLE> 
< DESCRIPTION>观看几个有蒸汽水的游泳池;人物，房屋和树木，以及遥远的背景中的山脉;< / DESCRIPTION> 
< NOTES>直到1532年，这个地方被称为Pulltumarca，在西班牙人的到来之前被重新命名为Bañosdel Inca（Inka的浴室）。今天，巴尼斯德尔印加是秘鲁最受欢迎的治疗浴。< / NOTES> 
< LOCATION> Cajamarca，秘鲁< / LOCATION> 
< DATE> 2002年10月< / DATE> 
< IMAGE> images / 02 / 2431.jpg< / IMAGE> 
< THUMBNAIL>缩略图/ 02 / 2431.jpg< / THUMBNAIL> 
< / DOC> 
  
 （注意：一旦MATLAB读取文件，内部将其重新编码为UTF -16） 
 
I have several xml files to process. sample file is given below
  <DOC>
  <DOCNO>2431.eng</DOCNO>
  <TITLE>The Hot Springs of Baños del Inca near Cajamarca</TITLE>
  <DESCRIPTION>view of several pools with steaming water; people, houses and 
   trees behind it, and a mountain range in the distant background;</DESCRIPTION>
   <NOTES>Until 1532 the place was called Pulltumarca, before it was renamed to
   "Baños  del Inca" (baths of the Inka) with the arrival of the Spaniards . 
   Today, Baños del Inca is the most-visited therapeutic bath of Peru.</NOTES>
   <LOCATION>Cajamarca, Peru</LOCATION>
   </DOC>        
While using the xmlread() matlab function I get the following error. 
    [Fatal Error] 2431.eng:3:29: Invalid byte 2 of 4-byte UTF-8 sequence.
    ??? Java exception occurred:
    org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

    Error in ==> xmlread at 98
    parseResult = p.parse(fileName);
Any suggestions of how to get around this problem?
 解决方案 
The sample you posted works just fine.

As the error message says, I think your actual files are incorrectly encoded. Remember that not all possible byte sequences are valid UTF-8 sequences: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

A quick way to check is to open the file in Firefox. If the XML file has encoding problems, you'll see an error message like:

  XML Parsing Error: not well-formed




EDIT:

So I took a look at the file: Your problem is that XML parsers treat files without the <?xml ... ?> declaration line as UTF-8, but your file looks to be encoded as ISO-8859-1 (Latin 1) or Windows-1252 (CP-1252) instead.

For instance, the SAX parser choked on the following token: Baños. This character "n letter with tilde", which is U+00F1, has different representation in the two encoding:


in ISO-8859-1, it is represented as one byte:  0xF1
in UTF-8, it is represented as two bytes: 0xC3 0xB1


While UTF-8 was designed to be backward compatibility with ASCII, the character ñ falls into the extended ASCII range, which are all represented as two or more bytes in UTF-8.

So when the substring ño stored in Latin-1 as 11110001 01101111 is interpreted as being UTF-8 encoded, the parser sees the first byte and recognizes it as the beginning of a 4-byte UTF-8 sequence of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. But since it clearly does not follow that format, an error is thrown:

  org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
Bottom line is: Always use an XML declaration! In your case, add the following line at the beginning of all your files:
<?xml version="1.0" encoding="ISO-8859-1"?>
or better yet, modify the program that generates these files to write the said line.

After this change, MATLAB (or really Java) should be able read the XML file correctly:
>> doc = xmlread('2431.eng');
>> doc.saveXML([])
ans =
<?xml version="1.0" encoding="UTF-16"?>
<DOC>
<DOCNO>annotations/02/2431.eng</DOCNO>
<TITLE>The Hot Springs of Baños del Inca near Cajamarca</TITLE>
<DESCRIPTION>view of several pools with steaming water; people, houses and trees behind it, and a mountain range in the distant background;</DESCRIPTION>
<NOTES>Until 1532 the place was called Pulltumarca, before it was renamed to "Baños del Inca" (baths of the Inka) with the arrival of the Spaniards . Today, Baños del Inca is the most-visited therapeutic bath of Peru.</NOTES>
<LOCATION>Cajamarca, Peru</LOCATION>
<DATE>October 2002</DATE>
<IMAGE>images/02/2431.jpg</IMAGE>
<THUMBNAIL>thumbnails/02/2431.jpg</THUMBNAIL>
</DOC>
(Note: Apparently once MATLAB reads the file, it internally re-encodes it as UTF-16)

                        这篇关于当Matlab中的一个XML文件中有一些特殊的UTF-8字符时，如何处理的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

当Matlab中的一个XML文件中有一些特殊的UTF-8字符时，如何处理 [英] How to handle when some special UTF-8 characters are inside a XML file in matlab

问题描述

编辑：

EDIT:

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

当Matlab中的一个XML文件中有一些特殊的UTF-8字符时，如何处理 [英] How to handle when some special UTF-8 characters are inside a XML file in matlab

问题描述

编辑：

EDIT:

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

登录关闭