用awk的sed或者grep的解析来自网页源URL [英] Using awk sed or grep to parse URLs from webpage source

查看：172 发布时间：2016/7/28 16:36:52 regex scripting sed awk grep

本文介绍了用awk的sed或者grep的解析来自网页源URL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解析下载的网页源以获取链接的列表。一行程序将正常工作。以下是我迄今为止尝试：

这似乎从某些页面的名字离开了URL的一部分。

  $猫file.html |的grep -o -E'\\ B（（[\\ W-] +：| [。] //域组织）[^ \\ S（）＆所述;＆GT;] +（：\\（[\\ W \\ D] + \\）|（[^ [：PUNCT：] \\ S] | /）））|排序-ut / -K3

这得到所有的网址，但我不想以包括/是锚链接的链接。此外，我希望能够指定domain.org/folder /：

  $的awk'BEGIN {
RS =下; / A＆gt;中
IGNORECASE = 1
}
{
  对于（O = 1; O＆LT; = NF;○++）{
    如果（$ O〜/ HREF /）{
      GSUB（/.* HREF = \\ 042 /，，$ O）
      GSUB（/ \\ 042 * /，，$ O）
      打印$（O）
    }
  }
}'file.html

解决方案

如果你只解析类似＆LT;一>标签，你可以只匹配这样的href属性：

  $猫file.html | grep的-o -E'HREF =（[^＃] +）'|切-d''-f2 |排序| uniq的

这将忽略锚，也保证你有唯一身份。这并假定页面已经充分形成（X）HTML，但你可以通过它传递整洁第一。

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. A one-liner would work fine. Here's what I've tried thus far:

This seems to leave out parts of the URL from some of the page names.

$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3

This gets all of the URL's but I do not want to include links that have/are anchor links. Also I want to be able to specify the domain.org/folder/:

$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' file.html

解决方案

If you are only parsing something like < a > tags, you could just match the href attribute like this:

$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq

That will ignore the anchor and also guarantee that you have uniques. This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first.

这篇关于用awk的sed或者grep的解析来自网页源URL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用awk的sed或者grep的解析来自网页源URL [英] Using awk sed or grep to parse URLs from webpage source

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

用awk的sed或者grep的解析来自网页源URL [英] Using awk sed or grep to parse URLs from webpage source

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭