直接从 Java 中的 URL 读取 [英] Reading directly from a URL in Java
问题描述
当我打印http://的内容时www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1,我看到的 HTML 与在我的浏览器中使用查看源代码"功能时显示的不同,尽管我认为确切的浏览器并不重要).例如,上述 URL 中 ID 为result_10"的 div 在浏览器中显示如下:
但是当使用 Java 的 java.net.URL
实用程序打印相同的网页内容时,相同的 div 显示如下:
这只是通过以编程方式阅读此页面和使用浏览器生成的 HTML 在标识符和页面结构方面的众多差异之一.我不确定这是否源于某种 URL 解析问题或完全不同的东西.
如何从 Java 应用程序获取在浏览器中看到的相同页面内容?
这是我一直用来读取 URL 的函数,http://www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1"是有问题的参数.
public static void printWebPageContents(String url) 抛出 IOException {URL指定Url = new URL(url);BufferedReader in = new BufferedReader(new InputStreamReader(specifiedUrl.openStream()));字符串输入行;while ((inputLine = in.readLine()) != null)System.out.println(inputLine);附寄();}
如果需要任何说明,请随时告诉我.
解决方案 如果它与您的 用户代理.我不知道 URL.openStream
的默认值是什么,但我怀疑它是否与 Chrome 相同.
When I print the contents of http://www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1, I see different HTML than what's displayed when utilizing the "View Source" feature in my browser (Chrome, in my case, though I don't think the exact browser matters). For example, the div with id "result_10" from the aforementioned URL appears like this in one's browser:
<div id="result_10" class="rsltGrid prod" name="B007I5JT4S">
But when printing the same web page contents with Java's java.net.URL
utility, the same div appears like this:
<div class="result product" id="result_10" name="B007I5JT4S">
This is just one of the many differences in identifiers and page structure between the HTML produced by programmatically reading this page and using a browser. I'm not sure if this stems from some sort of URL resolution issue or something entirely different.
How can I acquire the same page content I see in my browser from a Java app?
Here's the function I've been using to read URLs, with "http://www.amazon.com/s/ref=sr_pg_3?rh=n%3A172282&page=1" being the argument in question.
public static void printWebPageContents(String url) throws IOException {
URL specifiedUrl = new URL(url);
BufferedReader in = new BufferedReader(new InputStreamReader(specifiedUrl.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
Don't hesitate to let me know if any clarification is needed.
解决方案 I wouldn't be surprised if it had to do with your User Agent. I don't know what the default is for URL.openStream
, but I doubt it's the same as Chrome.
这篇关于直接从 Java 中的 URL 读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文