如何从HTML网页中提取所有网址 [英] How to extract all urls from HTML webpage

查看:158
本文介绍了如何从HTML网页中提取所有网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

I have below response that I got by sending GET request to some server (GET /k/302.html HTTP/1.0) using java socket connection.

HTTP/1.1 200 OK
Date: Thu, 25 Apr 2019 06:31:21 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: Thu, 11 Apr 2019 11:44:58 GMT
ETag: "59-5863fb73cdcbb"
Accept-Ranges: bytes
Content-Length: 89
Vary: Accept-Encoding
Connection: close
Content-Type: text/html

<html>
	<body>
		<a href="/"> More pages </a>
		<img src="redback.jpg">
	</body>
</html>
Connection closed by foreign host.

I have to write simple java code where I am suppose to crawl all the urls present on this current webpage (/k/302.html).
Currently I am able to extract the first url ("/") using java regular expression as <pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"





但是我无法获得第二个用于标记的网址。



以下是扩展了我从控制台获取的html内容,其中明确指出redback.jpg具有超链接。



.

But I am not able to get the second url which is for tag.

Below is the expanded html content that I got from console where it clearly specifies that "redback.jpg" has hyperlink.

<span class="html-tag"><img <span class="html-attribute-name">src</span>="<a class="html-attribute-value html-resource-link" target="_blank" href="redback.jpg" rel="noreferrer noopener">redback.jpg</a>"></span>





但是如果我们看到GET响应它没有清楚地告诉它有超链接。如何仅从响应中提取此类网址?我必须在简单的java中使用带有HTTP标准请求的套接字连接而不使用任何其他外部库。







我尝试过:



对于简单的网址,我尝试使用java regex



But if we see the GET response it does not clearly tells that it has hyperlink. How to extract such urls from response only? I have to do this in simple java using socket connection with HTTP standard request without use of any other external libraries.



What I have tried:

For simple url I tried using java regex

<pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"

。但是没有获得嵌入式href标记的方法,因为我在HTTP GET响应中没有得到这样的信息。

. but not getting how to get for embedded href tags because I do not get such information in HTTP GET response.

推荐答案

你可以搜索href =获取相对网址。基本上,如果正则表达式不起作用,找出它没有的情况,并将它们串起来
You can search for href= to get relative urls. Basically if a regex doesn't work, work out the cases where it doesn't and string mash them


这篇关于如何从HTML网页中提取所有网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆