如何从HTML网页中提取所有网址 [英] How to extract all urls from HTML webpage

查看：158 发布时间：2019/6/7 11:11:07 Java HTML HTTP

本文介绍了如何从HTML网页中提取所有网址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

I have below response that I got by sending GET request to some server (GET /k/302.html HTTP/1.0) using java socket connection.

HTTP/1.1 200 OK
Date: Thu, 25 Apr 2019 06:31:21 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: Thu, 11 Apr 2019 11:44:58 GMT
ETag: "59-5863fb73cdcbb"
Accept-Ranges: bytes
Content-Length: 89
Vary: Accept-Encoding
Connection: close
Content-Type: text/html

<html>
	<body>
		<a href="/"> More pages </a>
		<img src="redback.jpg">
	</body>
</html>
Connection closed by foreign host.

I have to write simple java code where I am suppose to crawl all the urls present on this current webpage (/k/302.html).
Currently I am able to extract the first url ("/") using java regular expression as <pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"

。

但是我无法获得第二个用于标记的网址。

以下是扩展了我从控制台获取的html内容，其中明确指出redback.jpg具有超链接。

.

But I am not able to get the second url which is for tag.

Below is the expanded html content that I got from console where it clearly specifies that "redback.jpg" has hyperlink.

<span class="html-tag"><img <span class="html-attribute-name">src</span>="<a class="html-attribute-value html-resource-link" target="_blank" href="redback.jpg" rel="noreferrer noopener">redback.jpg</a>"></span>

但是如果我们看到GET响应它没有清楚地告诉它有超链接。如何仅从响应中提取此类网址？我必须在简单的java中使用带有HTTP标准请求的套接字连接而不使用任何其他外部库。

我尝试过：

对于简单的网址，我尝试使用java regex

But if we see the GET response it does not clearly tells that it has hyperlink. How to extract such urls from response only? I have to do this in simple java using socket connection with HTTP standard request without use of any other external libraries.

What I have tried:

For simple url I tried using java regex

<pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"

。但是没有获得嵌入式href标记的方法，因为我在HTTP GET响应中没有得到这样的信息。

. but not getting how to get for embedded href tags because I do not get such information in HTTP GET response.

如何从HTML网页中提取所有网址 [英] How to extract all urls from HTML webpage

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

如何从HTML网页中提取所有网址 [英] How to extract all urls from HTML webpage

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭