直接从 Java 中的 URL 读取 [英] Reading directly from a URL in Java

查看:34
本文介绍了直接从 Java 中的 URL 读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我打印http://的内容时www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1,我看到的 HTML 与在我的浏览器中使用查看源代码"功能时显示的不同,尽管我认为确切的浏览器并不重要).例如,上述 URL 中 ID 为result_10"的 div 在浏览器中显示如下:

但是当使用 Java 的 java.net.URL 实用程序打印相同的网页内容时,相同的 div 显示如下:

这只是通过以编程方式阅读此页面和使用浏览器生成的 HTML 在标识符和页面结构方面的众多差异之一.我不确定这是否源于某种 URL 解析问题或完全不同的东西.

如何从 Java 应用程序获取在浏览器中看到的相同页面内容?

这是我一直用来读取 URL 的函数,http://www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1"是有问题的参数.

public static void printWebPageContents(String url) 抛出 IOException {URL指定Url = new URL(url);BufferedReader in = new BufferedReader(new InputStreamReader(specifiedUrl.openStream()));字符串输入行;while ((inputLine = in.readLine()) != null)System.out.println(inputLine);附寄();}

如果需要任何说明,请随时告诉我.

解决方案

如果它与您的 用户代理.我不知道 URL.openStream 的默认值是什么,但我怀疑它是否与 Chrome 相同.

When I print the contents of http://www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1, I see different HTML than what's displayed when utilizing the "View Source" feature in my browser (Chrome, in my case, though I don't think the exact browser matters). For example, the div with id "result_10" from the aforementioned URL appears like this in one's browser:

<div id="result_10" class="rsltGrid prod" name="B007I5JT4S">

But when printing the same web page contents with Java's java.net.URL utility, the same div appears like this:

<div class="result product" id="result_10" name="B007I5JT4S">

This is just one of the many differences in identifiers and page structure between the HTML produced by programmatically reading this page and using a browser. I'm not sure if this stems from some sort of URL resolution issue or something entirely different.

How can I acquire the same page content I see in my browser from a Java app?

Here's the function I've been using to read URLs, with "http://www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1" being the argument in question.

public static void printWebPageContents(String url) throws IOException {
    URL specifiedUrl = new URL(url);
    BufferedReader in = new BufferedReader(new InputStreamReader(specifiedUrl.openStream()));

    String inputLine;
    while ((inputLine = in.readLine()) != null)
        System.out.println(inputLine);

    in.close();
}

Don't hesitate to let me know if any clarification is needed.

解决方案

I wouldn't be surprised if it had to do with your User Agent. I don't know what the default is for URL.openStream, but I doubt it's the same as Chrome.

这篇关于直接从 Java 中的 URL 读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆