问号(char 57399)已添加到HTML元素文本 [英] Question mark (char 57399) added to HTML element text

查看:85
本文介绍了问号(char 57399)已添加到HTML元素文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个对我来说似乎很奇怪的问题.

I've come across a problem that seems really weird to me.

我正在使用Jsoup抓取网站:

I'm scraping a website using Jsoup:

Elements names = doc.select(".Mod.Thm-inherit").select("h3");

for (Element e : names) {
    System.out.println(e.text());
}

我的输出是(幻想曲棍球队名称,为简单起见,更改了名称):

My output is (Fantasy hockey team names, names changed for simplicity):

Team One ?
Team Two ?
Team Three ?
Team Four ?
Team Five ? 
//etc

现在,实际的团队名称没有多余的空格或问号.我以为我可以取代它,所以尝试了:

Now the actual team names don't have the extra space or question mark. Thinking I could just replace it, I tried:

String str = e.text().replaceAll("\\?", "");
System.out.println(str);

但是,这仍然在末尾输出问号.我认为这可能意味着它是Eclipse/Java无法识别的字符. (注意:它不显示 ,实际上只是通用的?)

This however still outputs the question mark at the end. I'm thinking that this might mean that it's a character that Eclipse/Java doesn't recognize. (Note: It doesn't display a �, it's really just the generic ?)

在查看HTML代码时,虽然没有多余的字符:

When looking at the HTML code, there are no extra characters though:

<script charset="utf-8" type="text/javascript" language="javascript">
<!-- Bunch of HTML -->
<div class="Grid-u-1-2 Pend-xl"><h3 class="My-xl Ta-c Fz-lg"><a href="/hockey/27381/1">Team One</a>

有人知道为什么会这样吗?

Anyone know why this is happening?

通过执行substring并删除最后两个字符,我很快就能解决此问题,但是我仍然想知道为什么会发生这种情况.

I was quickly able to solve the issue by just doing a substring and removing the last 2 characters, but I'd still like to know why it's happening.

Edit2 :我进一步使用它,发现如果我(int)投放问号,它将得到57399,而不是?的常规63.所以肯定是某种形式未知字符的问题.只是不确定为什么要添加它或该字符应该代表什么.

Playing around with it more, I found that if I (int) cast the question mark, it gives me 57399, instead of ?'s regular 63. So definitely some sort of unknown character issue. Just not sure why it's being added or what that character is supposed to represent.

推荐答案

我认为".Mod.Thm-inherit"元素中必须有额外的h3字段,这些字段带有奇怪的字符.

I think there must be extra h3 fields with strange characters inside your ".Mod.Thm-inherit"element.

要获得完整的解决方案,您必须提供更多信息,如@Jim Garrison所说.

For a complete solution you must provide more information as @Jim Garrison said.

以下代码:

    String html ="<div class=\"Grid-u-1-2 Pend-xl\"><h3 class=\"My-xl Ta-c Fz-lg\"><a href=\"/hockey/27381/1\">Team One</a>";
    Document doc = Jsoup.parse(html);
    Elements names = doc.select("h3");
    for (Element e : names) {
        System.out.println(e.text());
    }

给我期望的输出Team One.完全没有奇怪的人物.

Gives me the expected output Team One. With no strange characters at all.

希望它会有所帮助.最好的问候.

Hope it helps. Best regards.

这篇关于问号(char 57399)已添加到HTML元素文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆