使用Java在网页中查找单词 [英] Finding a word in a web page using java

查看:139
本文介绍了使用Java在网页中查找单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在特定网页中搜索特定单词,我正在使用Java和Eclipse。问题是,如果我要访问的网页几乎没有内容,那么它可以正常工作,但是当我在大型网页中尝试时,找不到该单词。

I am trying to search a specific word in a specific web page, I'm using Java and Eclipse. The problem is that if I'm taking a web page with almost without content it works fine, but when I'm trying in a "big" web page it doesn't find the word.

例如:我试图在网页中找到 [ InitialChatFriendsList 一词: https:// www。 facebook.com ,如果找到单词,则打印 WIN !!!

for example: I am trying to find the word ["InitialChatFriendsList" in the web page: https://www.facebook.com, if it finds the word then print WIN!!!

这是完整的Java代码:

Here is a full Java code:

public class BR4Qustion {               
    public static void main(String[] args) {
        BufferedReader br = null;
        try {
            URL url = new URL("https://www.facebook.com");  
            br = new BufferedReader(new InputStreamReader(url.openStream()));

            String foundWord = "[\"InitialChatFriendsList\"";          
            String sCurrentLine;

            while ((sCurrentLine = br.readLine()) != null) {
                String[] words = sCurrentLine.split(",");
                for (String word : words) {         
                    if (word.equals(foundWord)) {
                        System.out.println("WIN!!!");
                        break;
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null)
                    br.close();
            } catch (IOException ex) {
                System.out.println("*** IOException for URL : ");
            }
        }
    }
}


推荐答案

问题



除了代码中的一些小缺陷(您应该使用 try-with-ressources 和新的IO库 NIO ),它看起来完全可以,并且似乎没有逻辑错误。

Problem

Besides some small flaws with your code (you should use try-with-ressources and the new IO library NIO) it looks totally fine and does not seem to have a logical error.

您在这里面临另一个问题。尝试阅读 Facebook 时,您首先需要登录到您的帐户,否则您将看到起始页面

You are facing a different problem here. When trying to read Facebook you first need to login to your account, else you will see the starting page:

我想您认为从浏览器登录就足够了(对于例如Google Chrome浏览器),但事实并非如此。登录信息将保存在您使用的特定浏览器的本地存储中,例如保存在其 cookies 中。我们从一个会话开始讲话。

I guess you think that it is enough to login from your browser (for example Google Chrome) but that is not the case. Login information gets saved inside the local storage of the specific browser you have used, for example in its cookies. We talk from a session.

作为一个小实验,请使用您的Google Chrome浏览器访问Facebook并登录。使用Internet Explorer对其进行访问之后,它将未登录,并且您正在重新阅读起始页。

As a small experiment visit Facebook with your Google Chrome and login. After that visit it with Internet Explorer, it will not be logged in and you are reading the starting page again.

使用Java代码,您只是在阅读起始页,因为对于 Javas浏览器,您尚未登录。您可以通过转储 BufferedReader 正在读取的内容进行检查:

The same happens with your Java code, you are simply reading the starting page because for "Javas browser" you are not logged in already. You can just check it by dumping the content your BufferedReader is reading:

final URL url = new URL("https://www.facebook.com");
try (final BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()))) {
    // Read the whole page
    while (true) {
        final String line = br.readLine();
        if (line == null) {
            break;
        }

        System.out.println(line);
    }
}

看看输出,可能是起始页的来源。

Take a look at the output, it will probably be the source of the starting page.

之后通过我的浏览器登录到Facebook,该网站向我发送了以下cookie:

After logging in to Facebook via my browser the website sends me the following cookies:

突出显示的 c_user cookie与该会话肯定相关。如果我删除它并刷新页面,则我不再登录。

The highlighted c_user cookie is definitely relevant for the session. If I delete it and refresh the page then I am not logged in anymore.

为了工作,您的Java代码需要登录自己,方法是填写表单并提交(或仅发送相应的POST请求),然后听取Facebook的答复并保存所有内容Cookie信息。但是,单独执行此操作将是一项艰巨的任务,我不建议这样做。相反,您可以使用从Java内部模拟浏览器的API,例如 HTMLUnit 。另外,您可以使用 Selenium 之类的库,您可以通过其驱动程序界面直接控制自己喜欢的浏览器。

In order to work your Java code would need to login itself, via filling the form and submitting it (or just by sending the corresponding POST request), then listening to the answer of Facebook and saving all those cookie information. However doing this by yourself would be a huge task, I would not recommend it. Instead you could use an API that emulates a browser from inside Java, for example HTMLUnit. Alternatively you could use libraries like Selenium with which you can control your favorite browser directly via its driver interface.

另一种方法是劫持会话。在那里,您尝试从浏览器的本地文件中提取相关的cookie数据,并在Java应用程序中重新创建具有相同内容的cookie数据。

The other approach would be to hijack the session. There you try to extract the relevant cookie data from your browsers local files and recreate the cookie data inside your Java application, with the same content. Also a huge task without APIs for a non-expert.

现在,非常重要,请注意,Facebook(以及Twitter等其他网站)具有公共可用的API 面向开发人员的Facebook ),旨在简化与自动化软件的交互。当然也有Java API包装器可用,例如 Facebook4J 。因此,如果尝试抓取Facebook之类的网站,则应仅使用这些API。

Now, very important, note that Facebook (and also other websites like Twitter) have a public available API (Facebook for Developers) which is designed to ease the interaction with automated software. There are of course also Java API Wrapper available, like Facebook4J. So you should just use those APIs if trying to scrape sites like Facebook.

还请注意,许多网站(也包括Facebook)在其服务条款(TOS)中都指出了这种互动通过不使用其API的自动化软件被视为违反了这些条款。可能会导致法律后果

Also note that many sites, also Facebook, state in their Terms of Service (TOS) that interaction via automated software which does not use their API is treated as violation of those terms. It could result in legal consequences.

服务条款摘录:



  1. 安全

  1. Safety

  1. 您不会收集用户的内容或信息,或者否则访问Facebook ,使用未经我们事先许可的自动方式(例如收获机器人,机器人,蜘蛛或抓取器)。

  1. You will not collect users' content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our prior permission.



这篇关于使用Java在网页中查找单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆