Unix-解析html文件并获取他的所有资源列表 [英] Unix - parse html file and get all his resources list

查看:61
本文介绍了Unix-解析html文件并获取他的所有资源列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个html文件,我需要生成它使用的所有资源的列表: * .htm,*.html,*.css,*.js,*.jpg

I have an html file and i need to generate a list of all the resources it uses: *.htm, *.html, *.css, *.js, *.jpg

我尝试了很多类似grep和sed的选项,但没有太大的成功.也不确定如何在JAVA中进行操作.

I tried many options like grep and sed, without much sucess. Also am not sure how to do itin JAVA.

这是示例文件内容:

--------------------------------


>   <link rel="StyleSheet" href="css/webworks.css" type="text/css"
> media="all" />
>     <script type="text/javascript" language="JavaScript1.2"   src="wwhdata/common        /context.js">
>     /script>
>     <a class="WebWorks_Breadcrumb_Link" href="Page1.htm#1110364">Job Status</a> &gt;  Jobs tatus</div>
>     <div class="Indented"><a name="1115395">The <img class="Default"  src="images/Pic.2.jpg" width="26" height="29" style="display: inline;
 > float: none; left: 0.0; top: 0.0;" alt="" /> icon indicates that the
 > job is recurring. Hover the mouse over the icon to display the
     > schedule.</a></div>
 >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2',   'Page4.htm#1110375', '');"
 > title="fsafsa" name="1118038">abcde</a></div>
 >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2',   'Page2.htm#1110547', '');"
  > title="fsafsa" name="1118063">fsafsa</a></div>
  >     <div class="Body_Help_only"><a href="javascript:WWHClickedPopup('HelpSR2', 'Page3.htm#1110472', '');"
 > title="fsafasb" name="1118082">fsafsa</a></div>

输出应为:

-----------------
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
Page4.htm
Page2.htm
Page3.htm

推荐答案

以下内容将为您提供一些帮助:

The following should get you some of the way:

% sed -n -E 's/.*(href|src)="([^"]*).*/\2/p' input.html

-n表示默认情况下不打印行-E表示使用扩展的正则表达式(因此我们可以使用竖线进行替换);替换后的p表示打印出任何可以成功替换的行.在一起,这将找到所有上面带有href=src=的行,将整个行替换为"..."或直至#之间的内容,并打印出结果.

The -n means don't print lines by default; the -E means use extended regular expressions (so we can use the vertical bar for alternation); the trailing p on the substitution means print out any lines which have a successful substitution on them. Together, this finds any lines which have a href= or src= on them, replaces the entire line by what's between the "..." or up to a #, and prints out the result.

根据您的输入,将产生:

On your input, this produces:

css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
javascript:WWHClickedPopup('HelpSR2',   'Page4.htm
javascript:WWHClickedPopup('HelpSR2',   'Page2.htm
javascript:WWHClickedPopup('HelpSR2', 'Page3.htm

此简单版本的局限性:

  • 如果一行上的href或src不止一个,它将无法正常工作;
  • 无法提取Javascript参数的内容;
  • 假定输入使用"..."而不是'...'来分隔文件名.
  • it won't work if there's more than one href or src on a line;
  • it fails to extract the contents of the Javascript argument;
  • it presumes that the input uses "..." rather than '...' to delimit file names.

可以通过在sed脚本中添加适当的内容来改善其中的每一个,尽管第二个可能最好是通过另一个sed脚本或...发送输出来完成.

Each of these could probably be improved by suitable additions to the sed script, though the second would probably be best done by sending the output through another sed script or...

% cat /tmp/t.sed
s/.*(href|src)="([^#"]*).*/\2/
s/javascript.*'//
t x
b
:x
p
% sed -n -E -f /tmp/t.sed /tmp/so.txt
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
Page4.htm
Page2.htm
Page3.htm
%

最后一个有点特别!我将留给您和手册页来详细说明.

That last one's a little bit special! I'll leave you and the manpage to work out the details.

这篇关于Unix-解析html文件并获取他的所有资源列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆