DOM解析器阿拉伯文 [英] DOM parser in Arabic

查看:158
本文介绍了DOM解析器阿拉伯文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在DOM问题解析阿拉伯字母,我得到了奇怪的字符。我试图更改为不同的编码,但我不能。

全code是此链接: http://test11.host56.com/parser的.java

 公开文件getDomElement(字符串XML){
    文档DOC = NULL;
    DBF的DocumentBuilderFactory = DocumentBuilderFactory.newInstance();
   尝试{
       读卡器读卡器=新的InputStreamReader(新ByteArrayInputStream的(
       xml.getBytes(UTF-8)));
       InputSource的是=新的InputSource(读卡器);       的DocumentBuilder分贝= dbf.newDocumentBuilder();       // InputSource的是=新的InputSource();
       is.setCharacterStream(新StringReader(XML));
       DOC = db.parse(是);       返回文档;
   }
}

我的xml文件

 <?XML版本=1.0编码=UTF-8&GT?;
<音乐和GT;
<歌曲GT&;
    < ID> 1 LT; / ID>
    <标题>اهلاوسهل​​ا< /标题>
    <&艺术家GT;بكم< /艺术家>
    <持续时间GT; 4:47< /持续时间GT;
    &LT; thumb_url&GT; HTTP://wtever.png< / thumb_url&GT;
&LT; /曲&GT;
&LT; /音乐与GT;


解决方案

您已经拥有了XML作为字符串,所以,除非该字符串已经包含了奇怪的字符(即,它已经与错误的编码读取),你能避免使用StringReader替换编码这里疯狂;例如而不是:

 读卡器读卡器=新的InputStreamReader(新ByteArrayInputStream的(
   xml.getBytes(UTF-8)));

使用:

 读卡器读卡器=新StringReader(XML);

修改:现在,我看到更多的code,似乎编码问题已经happend在分析XML之前,因为这部分包括:

 的Htt presponse HTT presponse = httpClient.execute(httpPost);
HttpEntity httpEntity = HTT presponse.getEntity();
XML = EntityUtils.toString(httpEntity);

的Javadoc的<一个href=\"http://hc.apache.org/httpcomponents-core-ga/httpcore/apidocs/org/apache/http/util/EntityUtils.html#toString%28org.apache.http.HttpEntity%29\"相对=nofollow> EntityUtils.toString 说:


  

的内容是使用的字符从实体设置(如果有的话)转换,如若不然,ISO-8859-1被使用。


看来服务器不与所述实体发送正确的编码信息,然后将HttpUtils使用一个缺省值,即不是UTF-8。

修复:使用需要一个明确的默认编码变异:

  XML = EntityUtils.toString(httpEntity,UTF-8);

下面我假设服务器发送UTF-8。如果服务器使用不同的编码,一个人应该被设定,而不是UTF-8。 (然而,由于XML还声明编码=UTF-8我想是这样。)如果服务器使用的编码不知道,那么你就只能求助于野生猜测和的运气了,对不起。

I have a problem in DOM parsing Arabic letters, I got weird characters. I've tried changing to different encoding but I couldn't.

the full code is on this link: http://test11.host56.com/parser.java

public Document getDomElement(String xml) {
    Document doc = null;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
   try {
       Reader reader = new InputStreamReader(new ByteArrayInputStream(
       xml.getBytes("UTF-8")));
       InputSource is = new InputSource(reader);

       DocumentBuilder db = dbf.newDocumentBuilder();

       //InputSource is = new InputSource();
       is.setCharacterStream(new StringReader(xml));
       doc = db.parse(is);

       return doc;
   }
}

my xml file

<?xml version="1.0" encoding="UTF-8"?>
<music>
<song>
    <id>1</id>    
    <title>اهلا وسهلا</title>
    <artist>بكم</artist>
    <duration>4:47</duration>
    <thumb_url>http://wtever.png</thumb_url>
</song>
</music>

解决方案

You already have the xml as String, so unless that string already contains the odd characters (that is, it has been read in with the wrong encoding), you can avoid encoding madness here by using a StringReader instead; e.g. instead of:

Reader reader = new InputStreamReader(new ByteArrayInputStream(
   xml.getBytes("UTF-8")));

use:

Reader reader = new StringReader(xml);

Edit: now that I see more of the code, it seems the encoding issue already happend before the XML is parsed, because that part contains:

HttpResponse httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
xml = EntityUtils.toString(httpEntity);

The javadoc for the EntityUtils.toString says:

The content is converted using the character set from the entity (if any), failing that, "ISO-8859-1" is used.

It seems the server does not send the proper encoding information with the entity, and then the HttpUtils uses a default, which is not UTF-8.

Fix: use the variant that takes an explicit default encoding:

xml = EntityUtils.toString(httpEntity, "utf-8");

Here I assume the server sends UTF-8. If the server uses a different encoding, that one should be set instead of UTF-8. (However as the XML also declares encoding="UTF-8" I thought this is the case.) If the encoding the server uses is not known, then you can only resort to wild guessing and are out of luck, sorry.

这篇关于DOM解析器阿拉伯文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆