用awk的sed或者grep的解析来自网页源URL [英] Using awk sed or grep to parse URLs from webpage source

查看:172
本文介绍了用awk的sed或者grep的解析来自网页源URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析下载的网页源以获取链接的列表。一行程序将正常工作。以下是我迄今为止尝试:

这似乎从某些页面的名字离开了URL的一部分。

  $猫file.html |的grep -o -E'\\ B(([\\ W-] +:| [。] //域组织)[^ \\ S()&所述;>] +(:\\([\\ W \\ D] + \\)|([^ [:PUNCT:] \\ S] | /)))|排序-ut / -K3

这得到所有的网址,但我不想以包括/是锚链接的链接。此外,我希望能够指定domain.org/folder /:

  $的awk'BEGIN {
RS =下; / A>中
IGNORECASE = 1
}
{
  对于(O = 1; O< = NF;○++){
    如果($ O〜/ HREF /){
      GSUB(/.* HREF = \\ 042 /,,$ O)
      GSUB(/ \\ 042 * /,,$ O)
      打印$(O)
    }
  }
}'file.html


解决方案

如果你只解析类似<一>标签,你可以只匹配这样的href属性:

  $猫file.html | grep的-o -E'HREF =([^#] +)'|切-d''-f2 |排序| uniq的

这将忽略锚,也保证你有唯一身份。这并假定页面已经充分形成(X)HTML,但你可以通过它传递整洁第一。

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. A one-liner would work fine. Here's what I've tried thus far:

This seems to leave out parts of the URL from some of the page names.

$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3

This gets all of the URL's but I do not want to include links that have/are anchor links. Also I want to be able to specify the domain.org/folder/:

$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' file.html

解决方案

If you are only parsing something like < a > tags, you could just match the href attribute like this:

$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq

That will ignore the anchor and also guarantee that you have uniques. This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first.

这篇关于用awk的sed或者grep的解析来自网页源URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆