DOM解析器阿拉伯文 [英] DOM parser in Arabic
问题描述
我在DOM问题解析阿拉伯字母,我得到了奇怪的字符。我试图更改为不同的编码,但我不能。
全code是此链接: http://test11.host56.com/parser的.java
公开文件getDomElement(字符串XML){
文档DOC = NULL;
DBF的DocumentBuilderFactory = DocumentBuilderFactory.newInstance();
尝试{
读卡器读卡器=新的InputStreamReader(新ByteArrayInputStream的(
xml.getBytes(UTF-8)));
InputSource的是=新的InputSource(读卡器); 的DocumentBuilder分贝= dbf.newDocumentBuilder(); // InputSource的是=新的InputSource();
is.setCharacterStream(新StringReader(XML));
DOC = db.parse(是); 返回文档;
}
}
我的xml文件
<?XML版本=1.0编码=UTF-8&GT?;
<音乐和GT;
<歌曲GT&;
< ID> 1 LT; / ID>
<标题>اهلاوسهلا< /标题>
<&艺术家GT;بكم< /艺术家>
<持续时间GT; 4:47< /持续时间GT;
&LT; thumb_url&GT; HTTP://wtever.png< / thumb_url&GT;
&LT; /曲&GT;
&LT; /音乐与GT;
您已经拥有了XML作为字符串
,所以,除非该字符串已经包含了奇怪的字符(即,它已经与错误的编码读取),你能避免使用StringReader替换编码这里疯狂;例如而不是:
读卡器读卡器=新的InputStreamReader(新ByteArrayInputStream的(
xml.getBytes(UTF-8)));
使用:
读卡器读卡器=新StringReader(XML);
修改:现在,我看到更多的code,似乎编码问题已经happend在分析XML之前,因为这部分包括:
的Htt presponse HTT presponse = httpClient.execute(httpPost);
HttpEntity httpEntity = HTT presponse.getEntity();
XML = EntityUtils.toString(httpEntity);
的Javadoc的<一个href=\"http://hc.apache.org/httpcomponents-core-ga/httpcore/apidocs/org/apache/http/util/EntityUtils.html#toString%28org.apache.http.HttpEntity%29\"相对=nofollow> EntityUtils.toString
说:
的内容是使用的字符从实体设置(如果有的话)转换,如若不然,ISO-8859-1被使用。
块引用>看来服务器不与所述实体发送正确的编码信息,然后将HttpUtils使用一个缺省值,即不是UTF-8。
修复:使用需要一个明确的默认编码变异:
XML = EntityUtils.toString(httpEntity,UTF-8);
下面我假设服务器发送UTF-8。如果服务器使用不同的编码,一个人应该被设定,而不是UTF-8。 (然而,由于XML还声明
编码=UTF-8
我想是这样。)如果服务器使用的编码不知道,那么你就只能求助于野生猜测和的运气了,对不起。I have a problem in DOM parsing Arabic letters, I got weird characters. I've tried changing to different encoding but I couldn't.
the full code is on this link: http://test11.host56.com/parser.java
public Document getDomElement(String xml) { Document doc = null; DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); try { Reader reader = new InputStreamReader(new ByteArrayInputStream( xml.getBytes("UTF-8"))); InputSource is = new InputSource(reader); DocumentBuilder db = dbf.newDocumentBuilder(); //InputSource is = new InputSource(); is.setCharacterStream(new StringReader(xml)); doc = db.parse(is); return doc; } }
my xml file
<?xml version="1.0" encoding="UTF-8"?> <music> <song> <id>1</id> <title>اهلا وسهلا</title> <artist>بكم</artist> <duration>4:47</duration> <thumb_url>http://wtever.png</thumb_url> </song> </music>
解决方案You already have the xml as
String
, so unless that string already contains the odd characters (that is, it has been read in with the wrong encoding), you can avoid encoding madness here by using a StringReader instead; e.g. instead of:Reader reader = new InputStreamReader(new ByteArrayInputStream( xml.getBytes("UTF-8")));
use:
Reader reader = new StringReader(xml);
Edit: now that I see more of the code, it seems the encoding issue already happend before the XML is parsed, because that part contains:
HttpResponse httpResponse = httpClient.execute(httpPost); HttpEntity httpEntity = httpResponse.getEntity(); xml = EntityUtils.toString(httpEntity);
The javadoc for the
EntityUtils.toString
says:The content is converted using the character set from the entity (if any), failing that, "ISO-8859-1" is used.
It seems the server does not send the proper encoding information with the entity, and then the HttpUtils uses a default, which is not UTF-8.
Fix: use the variant that takes an explicit default encoding:
xml = EntityUtils.toString(httpEntity, "utf-8");
Here I assume the server sends UTF-8. If the server uses a different encoding, that one should be set instead of UTF-8. (However as the XML also declares
encoding="UTF-8"
I thought this is the case.) If the encoding the server uses is not known, then you can only resort to wild guessing and are out of luck, sorry.这篇关于DOM解析器阿拉伯文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!