如何从HTML网页中提取所有网址 [英] How to extract all urls from HTML webpage
问题描述
I have below response that I got by sending GET request to some server (GET /k/302.html HTTP/1.0) using java socket connection.
HTTP/1.1 200 OK
Date: Thu, 25 Apr 2019 06:31:21 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: Thu, 11 Apr 2019 11:44:58 GMT
ETag: "59-5863fb73cdcbb"
Accept-Ranges: bytes
Content-Length: 89
Vary: Accept-Encoding
Connection: close
Content-Type: text/html
<html>
<body>
<a href="/"> More pages </a>
<img src="redback.jpg">
</body>
</html>
Connection closed by foreign host.
I have to write simple java code where I am suppose to crawl all the urls present on this current webpage (/k/302.html).
Currently I am able to extract the first url ("/") using java regular expression as <pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"
。
但是我无法获得第二个用于标记的网址。
以下是扩展了我从控制台获取的html内容,其中明确指出redback.jpg具有超链接。
.
But I am not able to get the second url which is for tag.
Below is the expanded html content that I got from console where it clearly specifies that "redback.jpg" has hyperlink.
<span class="html-tag"><img <span class="html-attribute-name">src</span>="<a class="html-attribute-value html-resource-link" target="_blank" href="redback.jpg" rel="noreferrer noopener">redback.jpg</a>"></span>
但是如果我们看到GET响应它没有清楚地告诉它有超链接。如何仅从响应中提取此类网址?我必须在简单的java中使用带有HTTP标准请求的套接字连接而不使用任何其他外部库。
我尝试过:
对于简单的网址,我尝试使用java regex
But if we see the GET response it does not clearly tells that it has hyperlink. How to extract such urls from response only? I have to do this in simple java using socket connection with HTTP standard request without use of any other external libraries.
What I have tried:
For simple url I tried using java regex
<pre lang="java">"<a\\s+href\\s*=\\s*(\"[^\"]*\"|[^\\s>]*)\\s*>"
。但是没有获得嵌入式href标记的方法,因为我在HTTP GET响应中没有得到这样的信息。
. but not getting how to get for embedded href tags because I do not get such information in HTTP GET response.
推荐答案
你可以搜索href =获取相对网址。基本上,如果正则表达式不起作用,找出它没有的情况,并将它们串起来
You can search for href= to get relative urls. Basically if a regex doesn't work, work out the cases where it doesn't and string mash them
这篇关于如何从HTML网页中提取所有网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!