从Java中的网页拉取HTML [英] Pulling HTML from a Webpage in Java

查看:255
本文介绍了从Java中的网页拉取HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从Java中的网站(或Python或PHP,如果这些语言更容易显示)拉出整个HTML源代码文件。我只希望查看HTML并通过几种方法扫描它 - 不以任何方式编辑或操作它,并且我真的希望我不会将它写入新文件,除非没有其他方式。有没有这样的库类或方法?如果没有,有什么办法可以解决这个问题吗?

I want to pull the entire HTML source code file from a website in Java (or Python or PHP if it is easier in those languages to display). I wish only to view the HTML and scan through it with a few methods- not edit or manipulate it in any way, and I really wish that I do not write it to a new file unless there is no other way. Are there any library classes or methods that do this? If not, is there any way of going about this?

推荐答案

在Java中:

URL url = new URL("http://stackoverflow.com");
URLConnection connection = new URLConnection(url);
InputStream stream = url.openConnection();
// ... read stream like any file stream

此代码适用于脚本目的和内部使用。虽然我会反对将它用于生产用途。它不处理超时和失败的连接。

This code, is good for scripting purposes and internal use. I would argue against using it for production use though. It doesn't handle timeouts and failed connections.

我推荐使用 HttpClient库供生产使用。它支持身份验证,重定向处理,线程,池化等。

I would recommend using HttpClient library for production use. It supports authentication, redirect handling, threading, pooling, etc.

这篇关于从Java中的网页拉取HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆