不能使用StAX解析Apache Commons StringEscapeUtils.unescapeHtml生成的字符 [英] Characters generated by Apache Commons StringEscapeUtils.unescapeHtml cannnot be parsed using StAX

查看:215
本文介绍了不能使用StAX解析Apache Commons StringEscapeUtils.unescapeHtml生成的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析HTML表格的内容并将其写入CSV。
我正在尝试StaX解析器
html包含转义字符,如& nbps'& amp; $ b

我使用 org.apache.commons.lang3.StringEscapeUtils 来使用html行并写入一个新文件。



StAX仍然无法解析非转义字符。



请帮助我修复或处理此异常。



我使用以下xml片段进行测试 -
< root>< element> A& nbsp ; B& nbsp; < / element>< / root>



以下代码调用unescape html -

  StringEscapeUtils.unescapeHtml4(escapedHtml)

和把它写到一个文件中。

然后我尝试使用Stax Parser解析该文件 -

  public void unescapeHtmlFile(String filePath)throws IOException {
BufferedReader fileReader = null;
BufferedWriter fileWriter = null;
尝试{
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter(./out / UnescapedHtml.html));

String line = null;
String unescapedLine = null; ((line = fileReader.readLine())!= null){
System.out.println(Before:+ line);
unescapedLine = StringEscapeUtils.unescapeHtml4(line);
System.out.println(After:+ unescapedLine);
fileWriter.newLine();
fileWriter.write(unescapedLine);
}
} finally {
fileReader.close();
fileWriter.close();


输出结果如下 -

 文档开始
<?xml version =nullencoding ='UTF-8'standalone ='no'?>
元素开始
< root>
元素开始
< element0>
字符
0123456 7890 ABC DEF
元素结束
< / element0>
元素开始
< element1>
javax.xml.stream.XMLStreamException:[row,col]处的ParseError:[1,66]
消息:1字节UTF-8序列的无效字节1。
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl。 java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
javax.xml.stream.XMLStreamException:ParseError at [row,col]:[1,66]
消息:XML文档结构必须在同一个实体内开始和结束。
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl。 java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)

它无法解析& nbsp;
的未转义值。请帮助。

解决方案

FileReader和FileWriter类是旧的实用程序类,不幸的是使用了当前的平台编码。在Windows上几乎肯定不是UTF-8。一般来说,XML是UTF-8(它确实可以表示所有的字符)。

$ p $ fileReader =新的BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter(./out / UnescapedHtml.html));

应该是

  fileReader = new BufferedReader(new InputStreamReader(
新的FileInputStream(filePath),StandardCharsets.UTF_8));
fileWriter = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(./out / UnescapedHtml.html),
StandardCharsets.UTF_8 ));

说实话,应该读<?xml ...?> 并查看它是否具有字符集的编码属性,默认值为UTF-8。 code> StandardCharsets.ISO_8859_1 ,因为UTF-8在错误的多字节序列上绊倒了。



使用StandardCharsets而不是字符串UTF -8做一个


  1. 一个UnsupportedEncodingException来处理,
  2. 一个魔术常量。

StandardCharsets被保证支持。


I am trying to parse content of HTML table and write it to CSV. I am trying StaX parser The html contains escaped characters like &nbps' and &amp;

I am using org.apache.commons.lang3.StringEscapeUtils to usescape the html line by line and write to a new file.

StAX still fails to parse the unescaped characters.

Please help me fix or handle this exception.

I test with below xml fragment - <root><element>A &nbsp; B &nbsp; </element></root>

I call below code to unescape html -

   StringEscapeUtils.unescapeHtml4(escapedHtml)

and write it to a file.

I then try to parse that file using Stax Parser -

public void unescapeHtmlFile(String filePath) throws IOException{
    BufferedReader fileReader = null;
    BufferedWriter fileWriter = null;
    try{
    fileReader = new BufferedReader(new FileReader(filePath));
    fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));

    String line = null;
    String unescapedLine = null;
    while((line=fileReader.readLine())!=null){
        System.out.println("Before: " + line);
        unescapedLine = StringEscapeUtils.unescapeHtml4(line);
        System.out.println("After: " + unescapedLine);
        fileWriter.newLine();
        fileWriter.write(unescapedLine);
    }
    }finally{
        fileReader.close();
        fileWriter.close();
    }
}

And the output is below-

Document started 
<?xml version="null" encoding='UTF-8' standalone='no'?>
Element started
<root>
Element started
<element0>
Characters
0123456   7890   ABC   DEF
Element ended
</element0>
Element started
<element1>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
    at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
    at parser.StreamParserTest.main(StreamParserTest.java:30)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: XML document structures must start and end within the same entity.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
    at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
    at parser.StreamParserTest.main(StreamParserTest.java:30)

It fails to parse the unescaped value of &nbsp; Please help.

解决方案

The classes FileReader and FileWriter are old utility classes, that unfortunately use the current platform encoding. On Windows almost certainly not UTF-8. And XML in general is in UTF-8 (which indeed can represent all characters.

fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));

should be

fileReader = new BufferedReader(new InputStreamReader(
        new FileInputStream(filePath), StandardCharsets.UTF_8));
fileWriter = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream("./out/UnescapedHtml.html"),
        StandardCharsets.UTF_8));

To be entirely honest, one should read <?xml ...?> and look whether it has an encoding attribute for the charset, default is UTF-8. That could be done with StandardCharsets.ISO_8859_1, as UTF-8 stumbles over wrong multi-byte sequences.

Using StandardCharsets instead of Strings "UTF-8" does away with

  1. an UnsupportedEncodingException to handle,
  2. a magic constant.

The StandardCharsets are guaranteed to be supported.

这篇关于不能使用StAX解析Apache Commons StringEscapeUtils.unescapeHtml生成的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆