从…标记中提取文本或直接从HTML文件中提取文本 [英] extract text from … tag or directly from an HTML file

查看:171
本文介绍了从…标记中提取文本或直接从HTML文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个HTML页面,其中包含一些我想从Web服务器下载的文件名.我需要读取这些文件名才能创建一个列表,该列表将传递到从服务器下载文件的Web应用程序中.这些文件名具有一定的扩展性.

我已经研究过这个主题,但是除了--

1.正则表达式不能用于解析HTML.
2.使用HTML Agility Pack

没有其他方法可以让我从HTML文件中搜索具有filename.ext之类的模式的文本吗?

包含文件名的示例HTML-

I have an HTML page that contains some filenames that i want to download from a webserver. I need to read these filenames in order to create a list that will be passed to my web application that downloads the file from the server. These filenames have some extention.

I have digged about this topic but havn''t fount anything except -

1.Regex cannt be used to parse HTML.
2.Use HTML Agility Pack

Is there no other way so that i can search for text that have pattern like filename.ext from an HTML file?

Sample HTML that contains filename -

 <p class=3DMsoNormal style=3D'margin-removed0in;margin-removed0in;margin-bottom=:0in; margin-removed1.5in;margin-removed.0001pt;text-indent:-.25in;line-height:normal;mso-list:l1 level3 lfo8;tab-stops:list 1.5in'>
 <![if !supportLists]> 
   <span style=3D'font-family:"Times New Roman","serif";mso-fareast-font-family:"Times New Roman"'>
        <span style=3D'mso-list:Ignore'>1.
               <span style=3D'font:7.0pt "Times New Roman"'>
               </span>
        </span>
   </span>
 <![endif]>
   <span style=3D'font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"'>13572_PostAccountingReport_2009-06-03.acc
     <o:p> </o:p>
   </span>
</p>

 I cant use HTML Agility Pack because I m not allowed to download and make use of any application or tool.



难道这可以通过其他逻辑来实现吗?

在此先感谢

Akhil



Cant this be achieved by anyother logic?

Thanks in advance

Akhil

推荐答案

最简单的情况是,您可以将HTML解析为XML,因为您可以使用.NET附带的库中提供的任何XML解析器.框架.问题在于HTML可能不是格式正确的XML.在这种情况下,您将需要使用一些可以处理它的HTML解析器,例如以下代码:
http://www.majestic12.co.uk/projects/html_parser.php [ ^ ].

如果可以在其中找到文件的HTML上下文采用非常规则的形式,则仍可以使用Regex.毕竟,如果只需要查找文件,就不需要完整的解析.

—SA
The easiest situation would be if you could parse your HTML as XML, because you could use any of the XML parsers readily available in the libraries bundled with .NET Framework. The problem is that HTML can be not well-formed XML. In this case, you would need to use some HTML parser which could deal with it, such as this one:
http://www.majestic12.co.uk/projects/html_parser.php[^].

If the context of the HTML where you file can be found takes very regular form, you still could use Regex. After all, you don''t need to have a fully-fledged parsing if you only need to find a file.

—SA


这篇关于从…标记中提取文本或直接从HTML文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆