如何在博客中使用grep表示URL? [英] How to grep for URLs in a blog?

查看:76
本文介绍了如何在博客中使用grep表示URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个脚本,以从我的博客文章中获取URL,并对它们运行curl -I,以便我可以检查它们是否仍然有效.但是我在编写grep模式时遇到了麻烦.

I'm writing a script to grab the URLs from my blog posts and run curl -I over them so I can check they are still good. However I am having trouble writing the grep pattern.

<p><a href="http://example.com/fujipol/2004/may/5/16:10:47/400x345">foobar</a></p>

所以在这里我只想http://example.com/fujipol/2004/may/5/16:10:47/400x345.

或者在降价促销中,例如:

Or in markdown like:

[Example markdown link](https://example.com)

想要https://example.com

<http://example.com/?foo=bar>

在这种情况下,我需要http://example.com/?foo=bar

In this case I need http://example.com/?foo=bar

推荐答案

使用示例链接创建的文件:

Created file with links from your examples:

$> cat ./text
<p><a href="http://example.com/fujipol/2004/may/5/16:10:47/400x345">foobar</a></p>
[Example markdown link](https://example.com)
<http://example.com/?foo=bar>
<a href="http://people.debian.org/~dilinger/backports/wordpress">http://people.debian.org/~dilinger/backports/wordpress</a>

使用一些正则表达式"Greped"它,并从中获取所有网址:

"Greped" it with some regular expression and got all urls from it:

$> grep --only-matching --perl-regexp "http(s?):\/\/[^ \"\(\)\<\>]*" ./text
http://example.com/fujipol/2004/may/5/16:10:47/400x345
https://example.com
http://example.com/?foo=bar
http://people.debian.org/~dilinger/backports/wordpress
http://people.debian.org/~dilinger/backports/wordpress

完成.

http(s?):\/\/[^ \"\(\)\<\>]*

我们在这里所做的是匹配http(s)(URL可以以http://https://开头),而不是匹配//并对其进行了转义.最后我们匹配了不等于"()<>的符号序列.

What we've done here is matched http(s) (url could start with http:// or https://), than we matched // and escaped it. And finally we matched sequence of symbols not equal to or " or ( or ) or < or >.

最后,这样的任务中的整个问题都由我决定如何确定需要的部分(在这种情况下为http(s)://)和结束("()<>).

Finally, the whole problem in tasks like that is figured out how me decide that section we needed starts (http(s):// in that case) and ends (, ", (, ), <, > ).

坦率地说,该解决方案并非真正完美.某些url标准表示有关url可以包含或不能包含的符号的更多信息.因此,您马上就会发现,在我的答案中使用的正则表达式是无效的.但是在您描述的情况下,它可以出售.

Frankly speaking, that solution is not really perfect. Some url standards said much more information about symbols that url can include or can't. So, at once you will figured out, that regex using in my answer is not valid. But in cases that you described it works sell.

这篇关于如何在博客中使用grep表示URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆