问号(char 57399)已添加到HTML元素文本 [英] Question mark (char 57399) added to HTML element text
问题描述
我遇到了一个对我来说似乎很奇怪的问题.
I've come across a problem that seems really weird to me.
我正在使用Jsoup抓取网站:
I'm scraping a website using Jsoup:
Elements names = doc.select(".Mod.Thm-inherit").select("h3");
for (Element e : names) {
System.out.println(e.text());
}
我的输出是(幻想曲棍球队名称,为简单起见,更改了名称):
My output is (Fantasy hockey team names, names changed for simplicity):
Team One ?
Team Two ?
Team Three ?
Team Four ?
Team Five ?
//etc
现在,实际的团队名称没有多余的空格或问号.我以为我可以取代它,所以尝试了:
Now the actual team names don't have the extra space or question mark. Thinking I could just replace it, I tried:
String str = e.text().replaceAll("\\?", "");
System.out.println(str);
但是,这仍然在末尾输出问号.我认为这可能意味着它是Eclipse/Java无法识别的字符. (注意:它不显示 ,实际上只是通用的?
)
This however still outputs the question mark at the end. I'm thinking that this might mean that it's a character that Eclipse/Java doesn't recognize. (Note: It doesn't display a �, it's really just the generic ?
)
在查看HTML代码时,虽然没有多余的字符:
When looking at the HTML code, there are no extra characters though:
<script charset="utf-8" type="text/javascript" language="javascript">
<!-- Bunch of HTML -->
<div class="Grid-u-1-2 Pend-xl"><h3 class="My-xl Ta-c Fz-lg"><a href="/hockey/27381/1">Team One</a>
有人知道为什么会这样吗?
Anyone know why this is happening?
通过执行substring
并删除最后两个字符,我很快就能解决此问题,但是我仍然想知道为什么会发生这种情况.
I was quickly able to solve the issue by just doing a substring
and removing the last 2 characters, but I'd still like to know why it's happening.
Edit2 :我进一步使用它,发现如果我(int)
投放问号,它将得到57399,而不是?
的常规63.所以肯定是某种形式未知字符的问题.只是不确定为什么要添加它或该字符应该代表什么.
Playing around with it more, I found that if I (int)
cast the question mark, it gives me 57399, instead of ?
's regular 63. So definitely some sort of unknown character issue. Just not sure why it's being added or what that character is supposed to represent.
推荐答案
我认为".Mod.Thm-inherit"
元素中必须有额外的h3
字段,这些字段带有奇怪的字符.
I think there must be extra h3
fields with strange characters inside your ".Mod.Thm-inherit"
element.
要获得完整的解决方案,您必须提供更多信息,如@Jim Garrison所说.
For a complete solution you must provide more information as @Jim Garrison said.
以下代码:
String html ="<div class=\"Grid-u-1-2 Pend-xl\"><h3 class=\"My-xl Ta-c Fz-lg\"><a href=\"/hockey/27381/1\">Team One</a>";
Document doc = Jsoup.parse(html);
Elements names = doc.select("h3");
for (Element e : names) {
System.out.println(e.text());
}
给我期望的输出Team One
.完全没有奇怪的人物.
Gives me the expected output Team One
. With no strange characters at all.
希望它会有所帮助.最好的问候.
Hope it helps. Best regards.
这篇关于问号(char 57399)已添加到HTML元素文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!