使用Jsoup,我如何获取每个链接中的每个信息? [英] Using Jsoup, how can I fetch each and every information resides in each link?

查看:105
本文介绍了使用Jsoup,我如何获取每个链接中的每个信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  package com.muthu; 
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeVisitor;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.jsoup.nodes。*;
public class TestingTool
{
public static void main(String [] args)throws IOException
{
Validate.isTrue(args.length == 0,usage) :提供url来获取);
String url =http://www.stackoverflow.com/;
print(获取%s ...,url);
Document doc = Jsoup.connect(url).get();
Elements links = doc.select(a [href]);
System.out.println(doc.text());
Elements tags = doc.getElementsByTag(div);
String alls = doc.text();
System.out.println(\ n);
for(元素链接:链接)
{
print(%s,link.attr(abs:href),trim(link.text(),35));
}
BufferedWriter bw = new BufferedWriter(new FileWriter(new File(C:/ tool
/linknames.txt)));
for(元素链接:链接){
bw.write(Link:+ link.text()。trim());
bw.write(System.getProperty(line.separator));
}
bw.flush();
bw.close();
}}
private static void print(String msg,Object ... args){
System.out.println(String.format(msg,args));
}

private static String trim(String s,int width){
if(s.length()> width)
return s.substring(0 ,width-1)+。;
其他
返回s;
}
}


解决方案

如果你连接到一个URL,它只会解析当前页面。但您可以1.)连接到URL,2。)解析您需要的信息,3。)选择所有其他链接,4。)连接到它们和5.)只要有新链接就继续。

注意事项:




  • 您需要一个清单(? )或者你存储已经解析过的链接的其他内容

  • 你必须决定是否只需要这个页面的链接或外部链接

  • 你必须跳过约,联系等页面。






修改


(注意:您必须添加一些更改/错误处理代码)

 列表与LT;字符串> visitedUrls = new ArrayList<>(); //存储您已经访问过的所有链接


public void visitUrl(String url)throws IOException
{
url = url.toLowerCase(); //现在它不区分大小写

if(!visitedUrls.contains(url))//只有在没有被访问的情况下才这样做
{
Document doc = Jsoup.connect(url) )。得到(); //连接到Url并解析Document

/ * ...在此处选择您的数据... * /

元素nextLinks = doc.select(a [href] ); //选择下一个链接 - 添加更多限制!

for(Element next:nextLinks)//迭代所有Links
{
visitUrl(next.absUrl(href)); //所有下一个链接的递归调用
}
}
}

你必须在选择下一个链接的部分添加更多限制/检查(可能你想跳过/忽略一些);和一些错误处理。






编辑2:



要跳过忽略的链接,您可以使用:


  1. 创建一个Set / List / whatever,存储忽略的关键字

  2. 使用这些关键字填写

  3. 在使用以下方法调用 visitUrl()方法之前新的解析链接,您检查此新Url是否包含任何被忽略的关键字。如果它至少包含一个,它将被跳过。

我修改了一下这个例子(但是没有测试了!)。

  List< String> visitedUrls = new ArrayList<>(); //存储您已访问过的所有链接
Set< String> ignore = new HashSet<>(); //存储您想要忽略的所有关键字

// ...


/ *
*在ignorelist中添加关键字。将跳过包含此
*单词之一的每个链接。
*
*在例如这样做。构造函数,静态块或init方法。
* /
ignore.add(。twitter.com);

// ...


public void visitUrl(String url)throws IOException
{
url = url.toLowerCase() ; //现在它不区分大小写

if(!visitedUrls.contains(url))//只有在没有被访问的情况下才这样做
{
Document doc = Jsoup.connect(url) )。得到(); //连接到Url并解析Document

/ * ...在此处选择您的数据... * /

元素nextLinks = doc.select(a [href] ); //选择下一个链接 - 添加更多限制!

for(Element next:nextLinks)//迭代所有链接
{
boolean skip = false; //如果为false:解析url,如果为true:跳过它
final String href = next.absUrl(href); //选择'href'属性 - >下一个链接解析

for(String s:ignore)//迭代所有被忽略的关键字 - 也许这个
更好的解决方案{
if(href.contains(s) ))//如果网址包含忽略的关键字,则会跳过
{
skip = true;
休息;
}
}

if(!skip)
visitUrl(next.absUrl(href)); //所有下一个链接的递归调用
}
}
}

通过以下方式解析下一个链接:

  final String href = next.absUrl(href); 
/ * ... * /
visitUrl(next.absUrl(href));

但可能你应该为这部分增加一些停止条件。


     package com.muthu;
     import java.io.IOException;
     import org.jsoup.Jsoup;
     import org.jsoup.helper.Validate;
     import org.jsoup.nodes.Document;
     import org.jsoup.nodes.Element;
     import org.jsoup.select.Elements;
     import org.jsoup.select.NodeVisitor;
     import java.io.BufferedWriter;
     import java.io.File;
     import java.io.FileWriter;
     import java.io.IOException;
     import org.jsoup.nodes.*;
     public class TestingTool 
     {
        public static void main(String[] args) throws IOException
        {
    Validate.isTrue(args.length == 0, "usage: supply url to fetch");
            String url = "http://www.stackoverflow.com/";
            print("Fetching %s...", url);
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a[href]");
            System.out.println(doc.text());
            Elements tags=doc.getElementsByTag("div");
            String alls=doc.text();
            System.out.println("\n");
            for (Element link : links)
            {
        print("  %s  ", link.attr("abs:href"), trim(link.text(), 35));
            }
            BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:/tool                 
            /linknames.txt")));        
         for (Element link : links) {
            bw.write("Link: "+ link.text().trim());
        bw.write(System.getProperty("line.separator"));       
       }    
      bw.flush();     
      bw.close();
    }           }
    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
        }

解决方案

If you connect to an URL it will only parse the current page. But you can 1.) connect to an URL, 2.) parse the informations you need, 3.) select all further links, 4.) connect to them and 5.) continue this as long as there are new links.

considerations:

  • You need a list (?) or something else where you've store the links you already parsed
  • You have to decide if you need only links of this page or externals too
  • You have to skip pages like "about", "contact" etc.

Edit:
(Note: you have to add some changes / errorhandling code)

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

You have to add more restrictions / checks at the part where next links are selected (maybe you want to skip / ignore some); and some error handling.


Edit 2:

To skip ignored links you can use this:

  1. Create a Set / List / whatever, where you store ignored keywords
  2. Fill it with those keywords
  3. Before you call the visitUrl() method with the new Link to parse, you check if this new Url contains any of the ignored keywords. If it contains at least one it will be skipped.

I modified the example a bit to do so (but it's not tested yet!).

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore

// ...


/*
 * Add keywords to the ignorelist. Each link that contains one of this
 * words will be skipped.
 * 
 * Do this in eg. constructor, static block or a init method.
 */
ignore.add(".twitter.com");

// ...


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // Now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            boolean skip = false; // If false: parse the url, if true: skip it
            final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse

            for( String s : ignore ) // Iterate over all ignored keywords - maybe there's a better solution for this
            {
                if( href.contains(s) ) // If the url contains ignored keywords it will be skipped
                {
                    skip = true;
                    break;
                }
            }

            if( !skip )
                visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

Parsing the next link is done by this:

final String href = next.absUrl("href");
/* ... */
visitUrl(next.absUrl("href"));

But possibly you should add some more stop-conditions to this part.

这篇关于使用Jsoup,我如何获取每个链接中的每个信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆