jsoup超时,xml出现空白错误,基本遍历页面非常耗时 [英] jsoup times out, xml gets white space error, basic traversing through page is time consuming
问题描述
我想制作一个程序来解析html页面并选择有用的信息并显示它.我通过打开一个流然后逐行搜索此适当的内容来做到这一点,但这是一个耗时的过程.因此,我决定通过将其视为xml,然后使用xpath来实现.为此,我在系统上制作了一个xml文件并从流中加载内容,然后出现空白错误,然后决定将打开的文档定向为
I would like to make a program that parses the html page and selects useful information and displays it. I did it by opening a stream and then line by line searching for this appropriate content, but this is a time consuming process. So then I decided to do it by treating it as a xml and then using xpath. This I did by making a xml file on my system and loading the contents from the stream, and I got white space error, then I decide to direct open document as
doc = (Document) builder.parse(inputStream);
,但相同的错误仍然存在.在问完这里之后,建议我现在使用jSoup进行html解析:
but the same error still persists. After asking here I was suggested to use jSoup for html parsing, now when I execute my code for:
Document doc= Jsoup.connect(url).get();
我阅读超时.当用python制作并使用朴素的策略(例如使用字符串的find方法和搜索)时,同一程序的显示速度太快了.如何使其在Java中快速运行?
I get Read timed out. The same program when made in python and using a naive strategy like using find method of string and searching, I am displayed the contents and that too fast. How to make it work fast in java?
完整代码:
import java.io.*;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Parser {
public static void main(String[] args) {
Validate.isTrue(true, "usage: supply url to fetch");
try{
String url="http://www.spoj.com/ranks/PRIME1/";
Document doc= Jsoup.connect(url).get();
Elements es=doc.getElementsByAttributeValue("class","lightrow");
System.out.println(es.get(0).child(0).text());
}catch(Exception e){e.printStackTrace();}
}
}
例外:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:412)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)
at Parser.main(Parser.java:12)
推荐答案
您的防火墙或操作系统是否阻止了您的请求(也许它阻止了Java对Internet的访问)?您在使用PC还是例如.安卓?您的HTML页面是网站还是本地HTML文件? 请发布更多代码或您遇到的异常.
Does your firewall or OS block your request (maybe it blocks java access to internet)? Are you using PC or eg. Android? And is your HTML page a website or a (local) HTML file? Please post some more code or the exception you get.
请确保您未使用DOM文档,但未使用org.jsoup.nodes.Document
.
Please make shure you dont use a DOM Document but org.jsoup.nodes.Document
.
显示内容
您要如何显示内容?如果您只需要这样的值:
How do you want to display the content? If you simply need a value like this:
...
<div>some value</div>
...
您可以使用jsoup做到这一点:
You can do this with jsoup:
Document doc = ... // parse html file or connect to website
final String value = doc.select("div").first().text();
System.out.println(value);
由于默认的连接超时时间为3秒(3000毫秒),因此对于大型网站,应该更改它,因为加载数据可能需要一些时间:
Since the default connection timeout is 3 sec (3000 millis) it should be changed for big websites, because loading the data may take some time:
final String url = "http://www.spoj.com/ranks/PRIME1/";
final int timeout = 4000; // or higher
Document doc = Jsoup.connect(url).timeout(4000).get();
这篇关于jsoup超时,xml出现空白错误,基本遍历页面非常耗时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!