调试编码问题(R XML) [英] Debugging encoding problems (R XML)

查看:232
本文介绍了调试编码问题(R XML)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法在XML文件中查找编码问题?我试图解析这样一个文件(让我们称之为 doc )与 XML R ,但编码似乎有问题。

  xmlInternalTreeParse(doc, asText = TRUE)
错误:标有UTF-16的文档,但具有UTF-8内容。
错误:输入不正确UTF-8,表示编码!
错误:标签中的数据过早结束

大概是数据过早结束。但是,我确信这个文件中不存在过早的终结。



好的,下一个尝试:

  doc<  -  iconv(doc,to =UTF-8)
doc< - sub(utf-16,utf-8 )
xmlInternalTreeParse(doc,asText = T)
错误:标签中的数据提前结束

,并且再次列出标签跟随行号。我检查了这些行,我找不到任何错误。



另一个怀疑:文档中出现的μ字符可能会导致错误。所以下一个尝试:

  doc<  -  iconv(doc,to =UTF-8)
doc< ; - gsub(μ,micro,doc)
doc< - sub(utf-16,utf-8,doc)
xmlInternalTreeParse(doc,asText = T)
错误:标签中数据的过早结束

任何其他调试建议?



编辑:在尝试修复错误花了两天时间后,我还没有找到解决方案。不过,我认为我已经缩小了可能的答案。这是我发现的:




  • 复制 XML string从源数据库到一个文件,并将其保存为Notepad ++中的单独的 xml 文件 - > 标记为UTF-16的文档,但具有UTF-8内容


  • 更改<?xml version =1.0encoding =utf-16? < / code> to <?xml version =1.0encoding =utf-8?> (或编码=latin1) - > 无错误


  • 阅读 XML 数据库中的字符串通过 doc< - sqlQuery(myconn,query.text,stringsAsFactors = FALSE); doc< - doc [1,1] ,用 str_sub(doc,35,36)< - 8 code> str_sub(doc,31,36)< - latin1然后尝试用 xmlInternalTreeParse(doc) - > 标签中的数据提前结束


  • 阅读 XML 字符串,然后尝试使用 xmlInternalTreeParse(doc) - > 将文档标记为UTF -16但具有UTF-8内容。输入不正确UTF-8,表示编码!字节:0xE4 0x64 0x2E 0x20标签中的数据提前结束(标签列表如下)


  • 阅读 XML 如上所述的数据库字符串,并用 xmlInternalTreeParse(doc,encoding =latin1)解析 - > 标签中数据的过早结束


  • 使用 doc< - iconv在解析之前,doc [1,1],to =UTF-8) to =latin1 p>




非常感谢任何建议。

解决方案

发生编码问题是因为原始XML文件的编码和SQL数据库中的XML内容存储为 longtext t比赛。在XML字符串中替换编码的规范并转换该字符串解决了问题:

  doc<  -  sqlQuery(myconn, query.text,stringsAsFactors = FALSE)
doc< - iconv(doc [1,1],to =UTF-8)
doc< - sub(utf-16 utf-8,doc)
doc< - xmlInternalTreeParse(doc,asText = TRUE)

在从数据库检索期间截断XML字符串被证明是一个单独的问题。该解决方案在此提供:如何从具有R?的SQL数据库检索一个非常长的XML字符串?


Is there a way to locate an encoding problem within an XML file? I'm trying to parse such a file (let's call it doc) with the XML library in R, but there seems to be a problem with the encoding.

xmlInternalTreeParse(doc, asText=TRUE)
Error: Document labelled UTF-16 but has UTF-8 content.
Error: Input is not proper UTF-8, indicate encoding!
Error: Premature end of data in tag ...

and a list of tags with presumably premature end of data follows. However, I'm pretty sure that no premature ends exist in this document.

Ok, so next try:

doc <- iconv(doc, to="UTF-8")
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...

and again a list of tags follows along with line numbers. I've checked the lines and I can't find any errors.

Another suspicion: the "µ"-character that occurs in the document might cause the error. So next try:

doc <- iconv(doc, to="UTF-8")
doc <- gsub("µ", "micro", doc)
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...

Any other suggestions for debugging?

EDIT: After having spent two days with trying to fix the error, I still haven't found a solution. However, I think I have narrowed down the possible answers. Here is what I've found:

  • copying the XML string from the source database into a file and saving it as a separate xml file in Notepad++ --> Document labelled UTF-16 but has UTF-8 content.

  • changing <?xml version="1.0" encoding="utf-16"?> to <?xml version="1.0" encoding="utf-8"?> (or encoding="latin1") within this file --> no error

  • reading XML string from database via doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1], manipulating it with str_sub(doc, 35, 36) <- "8" or str_sub(doc, 31, 36) <- "latin1" and then trying to parse it with xmlInternalTreeParse(doc) --> Premature end of data in tag...

  • reading the XML string from database as above and then trying to parse it with xmlInternalTreeParse(doc) --> Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... (list of tags follows).

  • reading the XML string from database as above and parsing with xmlInternalTreeParse(doc, encoding="latin1") --> Premature end of data in tag...

  • using doc <- iconv(doc[1,1], to="UTF-8") or to="latin1" before parsing doesn't change anything

I would appreciate any suggestions very much.

解决方案

The encoding problem occurred because the encoding of the original XML file and the encoding within the SQL database where the XML content was stored as longtext didn't match. Substituting the specification of the encoding within the XML string and converting this string solved the problem:

doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE)
doc <- iconv(doc[1,1], to="UTF-8")
doc <- sub("utf-16", "utf-8", doc)
doc <- xmlInternalTreeParse(doc, asText = TRUE)

Truncating of the XML string during retrieval from the database turned out to be a separate problem. The solution is provided here: How to retrieve a very long XML-string from an SQL database with R?

这篇关于调试编码问题(R XML)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆