为什么XmlParser将我的字符十六进制代码字符串转换为unicode? [英] Why is XmlParser converting my character hex code string to unicode?

查看:133
本文介绍了为什么XmlParser将我的字符十六进制代码字符串转换为unicode?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的Grails应用程序中,我使用Groovy的 XmlParser 来解析XML文件。我的XML文件中的其中一个属性的值是一个等于字符十六进制代码的字符串。我想将该字符串保存在我的数据库中:

Ñ



不幸的是,属性方法返回Ñ 字符,实际存储在数据库中的是 c391 。当字段被读出时,我还得到Ñ 字符,这是不受欢迎的。

如何将十六进制代码存储为在我的数据库中的一个字符串,并确保它读取作为十六进制代码以及?



更新#1:



这对我来说是一个问题,原因是一旦我将XML文件读入数据库,我必须能够完全按照原样重构它。另一个问题是该字段并不总是字符十六进制代码。它可能只是一些任意字符串。



更新#2:

我猜猜这个字符如何存储在数据库中并不重要,只要我可以用扩展的十六进制代码格式将其写回。我正在使用Groovy MarkupBuilder 从数据库中重建我的XML文件,而我我不清楚为什么这种情况在默认情况下不会发生。



更新3:

<我在我自定义的MySQL方言中覆盖了 getTableTypeString ,这似乎帮助了一些事情。至少现在,我传递给MySQL的值是存储在数据库中的值。

 类CustomMySQL5InnoDBDialect继承MySQL5InnoDBDialect {
@Override
public String getTableTypeString(){
returnENGINE = InnoDB DEFAULT CHARSET = utf8
}
}

我还创建了自己的 groovy.util.XmlParser 。我的版本几乎完全是 groovy.util.XmlParser 的重复,除了在 startElement 方法中我更改了:

 字符串值= list.getValue(i)

改为:

  def value = list.fAttributes.fAttributes [i]。 nonNormalizedValue 
if(value ==〜/& #x([0-9A-F] +?); /){
value = list.fAttributes.fAttributes [i] .nonNormalizedValue
}

这允许将十六进制代码元素的确切文本存储在数据库中。



现在有两个新问题,可能有三个。


$ b


  1. 重新创建文件与存储在数据库中的确切值进行比较。到目前为止,我一直在使用 MarkupBuilder ,但是对&符号执行额外的编码,导致值&#xD1; 写成& amp;#xD1; 我可以通过放弃 MarkupBuilder 使用Saxon-HE 9.4处理器在XML文件上运行XSLT转换会导致一些十六进制代码值例如& #xFF; 改变为像ÿ之类的内容,但其他类似&#x99; 保持不变。


  2. 我不确定这会不会有问题,但是当我重新创建文件我希望它在 ANSI 编码中,因为这是用于原始文件的编码。



  3. 解决方案


    我的XML文件中的其中一个属性的值是一个等于字符十六进制代码的字符串


    不,它不是。原始XML中属性值的表示是十六进制字符引用,但该属性的是字符Ñ。有些方法可以配置一些XML解析器,以避免在解析过程中扩展名为 entity 的引用,但它们必须根据XML规范扩展数字字符引用。



    你没有说过为什么存储真正的字符值是一个问题。如果是将值呈现给浏览器,那么可以通过在输出时使用 .encodeAsHTML()来处理。如果您需要将该值保存到另一个XML文件中,则使用XML API来完成此操作,并且它将为您处理编码问题,将字符替换为实体或字符引用,如果需要这样做以保持结果良好(在大多数情况下,它不需要被转义,除非你在一个不寻常的字符集中编写XML)。



    在Groovy的MarkupBuilder的特定情况下,您可以暂时使用 mkp.yieldUnescaped ,它可以让你输出一个字符引用,这个地方建造者通常不会打扰。 b $ b

    In my Grails application I use Groovy's XmlParser to parse an XML file. The value of one of the attributes in my XML file is a string that equals a character hex code. I want to save that string in my database:

    &#xD1;

    Unfortunately the attribute method returns the Ñ character and what actually gets stored in the database is c391. When the field is read back out I also get the Ñ character which is undesired.

    How can I store the hex code as a string in my database and make sure it gets read back out as a hex code as well?

    Update #1:

    The reason this is a problem for me is that once I read the XML file into my database I must be able to reconstruct it exactly as it was. An additional problem is that the field in question isn't always a character hex code. It could just be some arbitrary string.

    Update #2:

    I guess it doesn't matter how the character is stored in the database, so long as I can write it back out in its expanded hex code format. I am using Groovy MarkupBuilder to reconstruct my XML file from the database and I am unclear why this isn't happening by default.

    Update #3:

    I overrode getTableTypeString in my custom MySQL dialect and that seems to have helped things some what. At least now the value I pass to MySQL is the value that gets stored in the database.

    class CustomMySQL5InnoDBDialect extends MySQL5InnoDBDialect {   
        @Override
        public String getTableTypeString() {
            return " ENGINE=InnoDB DEFAULT CHARSET=utf8"
        }
    }
    

    I also created my own version of groovy.util.XmlParser. My version is pretty much an exact duplicate of groovy.util.XmlParser except that in the startElement method I changed:

    String value = list.getValue(i)
    

    to this:

    def value = list.fAttributes.fAttributes[i].nonNormalizedValue
    if(value ==~ /&#x([0-9A-F]+?);/) {
        value = list.fAttributes.fAttributes[i].nonNormalizedValue
    }
    

    This allows the exact text of hex code elements to be stored in the database.

    Now there are two new problems, possibly three.

    1. Recreating a file with the exact values stored in the database. Up till now I had been using MarkupBuilder, but that is doing extra encoding on ampersands, causing the value &#xD1; to be written out as &amp;#xD1; I can probably get around this by abandoning MarkupBuilder and building my XML strings manually, but I would rather not.

    2. Running an XSLT transform on an XML file using the Saxon-HE 9.4 processor causes some hex code values such as &#xFF; to be changed to something like ÿ, yet others like &#x99; are left unchanged.

    3. I'm not sure if this is going to be a problem yet or not, but when I recreate the file I would like it to be in ANSI encoding since that is the encoding used for the original file.

    解决方案

    The value of one of the attributes in my XML file is a string that equals a character hex code

    No it isn't. The representation of the attribute value in the original XML is a hexadecimal character reference, but the value of the attribute is the character Ñ. There are ways to configure some XML parsers to avoid expanding named entity references during parsing, but they must expand numeric character references as per the XML spec.

    You haven't said why storing the real character value is a problem. If it's to do with rendering the value to a browser then that can be handled by using .encodeAsHTML() at output time. If you need to save the value to another XML file then use an XML API to do so and it will handle the encoding issues for you, replacing characters with entities or character references where this is required to keep the result well-formed (in the case of Ñ it doesn't need to be escaped anyway unless you're writing XML in an unusual character set).

    In the specific case of Groovy's MarkupBuilder you can temporarily escape from XML mode and write hand-constructed markup directly to the output stream using mkp.yieldUnescaped, which would let you output a character reference somewhere the builder wouldn't normally bother.

    这篇关于为什么XmlParser将我的字符十六进制代码字符串转换为unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆