HTMLCLEANER处理西班牙字符 [英] HTMLCLEANER handle Spanish characters
问题描述
我使用HtmlCleaner库来解析/转换java中的HTML文件。
似乎无法处理像'ÁáÉéÍíÑñÓóÚúÜü'这样的西班牙字符
p>是否有任何属性可以在HtmlCleaner中设置以处理这个或任何其他解决方案?这里是我用来调用它的代码:
CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File(C:\\example.html);
TagNode tagNode = new HtmlCleaner(props).clean(file);
HtmlCleaner使用从JVM读取的默认字符集,除非指定。在Windows上,这将是Cp1512而不是UTF-8,这可能是出错的地方。
您可以 使用接受字符集的 (如果您在项目中使用Google Guava,您可以使用
-Dfile.encoding = UTF-8
HtmlCleaner.clean()
重载
TagNode tagNode = new HtmlCleaner(道具).clean(文件,UTF-8);
Charsets .UTF_8
为常量)
HtmlCleaner.clean()
超载接受一个你已经用正确的字符集构建的InputStreamReader。
I am using HtmlCleaner library in order to parse/convert HTML files in java.
It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü'
Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here's the code I'm using to invoke it:
CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);
HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it's going wrong.
You can either
- specify
-Dfile.encoding=UTF-8
on your JVM start line use the
HtmlCleaner.clean()
overload that accepts a character setTagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
(if you've got Google Guava in the project you can use
Charsets.UTF_8
for the constant)- use the
HtmlCleaner.clean()
overload that accepts an InputStreamReader which you've already constructed with the correct character set.
这篇关于HTMLCLEANER处理西班牙字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!