不能使用StAX解析Apache Commons StringEscapeUtils.unescapeHtml生成的字符 [英] Characters generated by Apache Commons StringEscapeUtils.unescapeHtml cannnot be parsed using StAX
问题描述
我试图解析HTML表格的内容并将其写入CSV。
我正在尝试StaX解析器
html包含转义字符,如& nbps'
和& amp;
$ b
我使用 org.apache.commons.lang3.StringEscapeUtils
来使用html行并写入一个新文件。
StAX仍然无法解析非转义字符。
请帮助我修复或处理此异常。
我使用以下xml片段进行测试 -
< root>< element> A& nbsp ; B& nbsp; < / element>< / root>
以下代码调用unescape html -
StringEscapeUtils.unescapeHtml4(escapedHtml)
和把它写到一个文件中。
然后我尝试使用Stax Parser解析该文件 -
public void unescapeHtmlFile(String filePath)throws IOException {
BufferedReader fileReader = null;
BufferedWriter fileWriter = null;
尝试{
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter(./out / UnescapedHtml.html));
String line = null;
String unescapedLine = null; ((line = fileReader.readLine())!= null){
System.out.println(Before:+ line);
unescapedLine = StringEscapeUtils.unescapeHtml4(line);
System.out.println(After:+ unescapedLine);
fileWriter.newLine();
fileWriter.write(unescapedLine);
}
} finally {
fileReader.close();
fileWriter.close();
输出结果如下 -
文档开始
<?xml version =nullencoding ='UTF-8'standalone ='no'?>
元素开始
< root>
元素开始
< element0>
字符
0123456 7890 ABC DEF
元素结束
< / element0>
元素开始
< element1>
javax.xml.stream.XMLStreamException:[row,col]处的ParseError:[1,66]
消息:1字节UTF-8序列的无效字节1。
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl。 java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
javax.xml.stream.XMLStreamException:ParseError at [row,col]:[1,66]
消息:XML文档结构必须在同一个实体内开始和结束。
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl。 java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
它无法解析& nbsp;
的未转义值。请帮助。
FileReader和FileWriter类是旧的实用程序类,不幸的是使用了当前的平台编码。在Windows上几乎肯定不是UTF-8。一般来说,XML是UTF-8(它确实可以表示所有的字符)。
$ p $ fileReader =新的BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter(./out / UnescapedHtml.html));
应该是
fileReader = new BufferedReader(new InputStreamReader(
新的FileInputStream(filePath),StandardCharsets.UTF_8));
fileWriter = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(./out / UnescapedHtml.html),
StandardCharsets.UTF_8 ));
说实话,应该读<?xml ...?>
并查看它是否具有字符集的编码
属性,默认值为UTF-8。 code> StandardCharsets.ISO_8859_1 ,因为UTF-8在错误的多字节序列上绊倒了。
使用StandardCharsets而不是字符串UTF -8做一个
- 一个UnsupportedEncodingException来处理,
- 一个魔术常量。
StandardCharsets被保证支持。
I am trying to parse content of HTML table and write it to CSV.
I am trying StaX parser
The html contains escaped characters like &nbps'
and &
I am using org.apache.commons.lang3.StringEscapeUtils
to usescape the html line by line and write to a new file.
StAX still fails to parse the unescaped characters.
Please help me fix or handle this exception.
I test with below xml fragment -
<root><element>A B </element></root>
I call below code to unescape html -
StringEscapeUtils.unescapeHtml4(escapedHtml)
and write it to a file.
I then try to parse that file using Stax Parser -
public void unescapeHtmlFile(String filePath) throws IOException{
BufferedReader fileReader = null;
BufferedWriter fileWriter = null;
try{
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));
String line = null;
String unescapedLine = null;
while((line=fileReader.readLine())!=null){
System.out.println("Before: " + line);
unescapedLine = StringEscapeUtils.unescapeHtml4(line);
System.out.println("After: " + unescapedLine);
fileWriter.newLine();
fileWriter.write(unescapedLine);
}
}finally{
fileReader.close();
fileWriter.close();
}
}
And the output is below-
Document started
<?xml version="null" encoding='UTF-8' standalone='no'?>
Element started
<root>
Element started
<element0>
Characters
0123456 7890 ABC DEF
Element ended
</element0>
Element started
<element1>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: XML document structures must start and end within the same entity.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
It fails to parse the unescaped value of
Please help.
The classes FileReader and FileWriter are old utility classes, that unfortunately use the current platform encoding. On Windows almost certainly not UTF-8. And XML in general is in UTF-8 (which indeed can represent all characters.
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));
should be
fileReader = new BufferedReader(new InputStreamReader(
new FileInputStream(filePath), StandardCharsets.UTF_8));
fileWriter = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("./out/UnescapedHtml.html"),
StandardCharsets.UTF_8));
To be entirely honest, one should read <?xml ...?>
and look whether it has an encoding
attribute for the charset, default is UTF-8. That could be done with StandardCharsets.ISO_8859_1
, as UTF-8 stumbles over wrong multi-byte sequences.
Using StandardCharsets instead of Strings "UTF-8" does away with
- an UnsupportedEncodingException to handle,
- a magic constant.
The StandardCharsets are guaranteed to be supported.
这篇关于不能使用StAX解析Apache Commons StringEscapeUtils.unescapeHtml生成的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!