如何“扫描”一个网站(或网页)的信息,并将其带入我的程序? [英] How to "scan" a website (or page) for info, and bring it into my program?

查看:212
本文介绍了如何“扫描”一个网站(或网页)的信息,并将其带入我的程序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我非常想弄清楚如何从网页中提取信息,并将其带入我的程序(使用Java)。

例如,如果我知道我想要的信息的确切网页,为了简化百思买商品页面,我将如何获取我需要的相应信息该页面?像标题,价格,描述一样?

这个过程甚至会被称为什么?编辑:
好​​的,我正在对JSoup(BalusC发布的那个)进行测试, ,但我不断收到此错误:

 线程main中的异常java.lang.NoSuchMethodError:java.util.LinkedList。 peekFirst()Ljava /郎/对象; 
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup .parser.Parser.parse(Parser.java:76)
在org.jsoup.parser.Parser.parse(Parser.java:51)
在org.jsoup.Jsoup.parse(Jsoup.java :28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)

我有Apache Commons

解决方案

使用HTML解析器如 Jsoup 。这是我的首选在其他HTML因为它支持 jQuery ,比如 CSS选择器。此外,它的类表示节点列表, Elements ,实现 Iterable ,以便您可以在增强for循环(所以不需要详细讨论 Node NodeList 类似于普通Java DOM解析器中的类)。



以下是一个基本的开球示例(只需将 classpath中的最新Jsoup JAR文件):

  package com .stackoverflow.q2835505; 

导入org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

public static void main(String [] args)throws Exception {
String url =https://stackoverflow.com/questions / 2835505\" ;
Document document = Jsoup.connect(url).get();

String question = document.select(#question .post-text)。text();
System.out.println(Question:+ question);

元素answerers = document.select(#answers .user-details a); (元素回答者:回答者)
{
System.out.println(回答者:+ answerer.text());



$ b $ / code>

当你可能猜到了,这会打印出你自己的问题和所有回答者的名字。


Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).

For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?

What would this process even be called? I have no idea were to even begin researching this.

Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:

Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)

I do have Apache Commons

解决方案

Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes, Elements, implements Iterable so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verbose Node and NodeList like classes in the average Java DOM parser).

Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

As you might have guessed, this prints your own question and the names of all answerers.

这篇关于如何“扫描”一个网站(或网页)的信息,并将其带入我的程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆