jsoup超时,xml出现空白错误,基本遍历页面非常耗时 [英] jsoup times out, xml gets white space error, basic traversing through page is time consuming

查看:74
本文介绍了jsoup超时,xml出现空白错误,基本遍历页面非常耗时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想制作一个程序来解析html页面并选择有用的信息并显示它.我通过打开一个流然后逐行搜索此适当的内容来做到这一点,但这是一个耗时的过程.因此,我决定通过将其视为xml,然后使用xpath来实现.为此,我在系统上制作了一个xml文件并从流中加载内容,然后出现空白错误,然后决定将打开的文档定向为

I would like to make a program that parses the html page and selects useful information and displays it. I did it by opening a stream and then line by line searching for this appropriate content, but this is a time consuming process. So then I decided to do it by treating it as a xml and then using xpath. This I did by making a xml file on my system and loading the contents from the stream, and I got white space error, then I decide to direct open document as

doc = (Document) builder.parse(inputStream);

,但相同的错误仍然存​​在.在问完这里之后,建议我现在使用jSoup进行html解析:

but the same error still persists. After asking here I was suggested to use jSoup for html parsing, now when I execute my code for:

Document doc= Jsoup.connect(url).get();

我阅读超时.当用python制作并使用朴素的策略(例如使用字符串的find方法和搜索)时,同一程序的显示速度太快了.如何使其在Java中快速运行?

I get Read timed out. The same program when made in python and using a naive strategy like using find method of string and searching, I am displayed the contents and that too fast. How to make it work fast in java?

完整代码:

import java.io.*;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Parser {
public static void main(String[] args) {
    Validate.isTrue(true, "usage: supply url to fetch");
    try{
        String url="http://www.spoj.com/ranks/PRIME1/";
        Document doc= Jsoup.connect(url).get();
        Elements es=doc.getElementsByAttributeValue("class","lightrow");
        System.out.println(es.get(0).child(0).text());


    }catch(Exception e){e.printStackTrace();}
}

}

例外:

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:412)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)
at Parser.main(Parser.java:12)

推荐答案

您的防火墙或操作系统是否阻止了您的请求(也许它阻止了Java对Internet的访问)?您在使用PC还是例如.安卓?您的HTML页面是网站还是本地HTML文件? 请发布更多代码或您遇到的异常.

Does your firewall or OS block your request (maybe it blocks java access to internet)? Are you using PC or eg. Android? And is your HTML page a website or a (local) HTML file? Please post some more code or the exception you get.

请确保您未使用DOM文档,但未使用org.jsoup.nodes.Document.

Please make shure you dont use a DOM Document but org.jsoup.nodes.Document.

显示内容

您要如何显示内容?如果您只需要这样的值:

How do you want to display the content? If you simply need a value like this:

...
<div>some value</div>
...

您可以使用jsoup做到这一点:

You can do this with jsoup:

Document doc = ... // parse html file or connect to website

final String value = doc.select("div").first().text();

System.out.println(value);

由于默认的连接超时时间为3秒(3000毫秒),因此对于大型网站,应该更改它,因为加载数据可能需要一些时间:

Since the default connection timeout is 3 sec (3000 millis) it should be changed for big websites, because loading the data may take some time:

final String url = "http://www.spoj.com/ranks/PRIME1/";
final int timeout = 4000; // or higher

Document doc = Jsoup.connect(url).timeout(4000).get();

这篇关于jsoup超时,xml出现空白错误,基本遍历页面非常耗时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆