提取 HTML 标签之间的文本 [英] Extract text between HTML tags

查看:68
本文介绍了提取 HTML 标签之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多 HTML 文件需要从中提取文本.如果全部在一行上,我可以很容易地做到这一点,但如果标签环绕或在多行上,我无法弄清楚如何做到这一点.这就是我的意思:

这里有一些文字这里的另一行 <br>最后一行文字.</节>

我不关心 <br> 文本,除非它有助于环绕文本.我想要的区域总是以MySection"开头,然后以</section>结束.我想最终得到的是这样的:

这里有一些文字,这里有另一行,最后一行文字.

我更喜欢 vbscript 或命令行选项(sed?)之类的东西,但我不确定从哪里开始.有什么帮助吗?

解决方案

通常您会为此使用 Internet Explorer COM 对象:

root = "C:\base\dir"Set ie = CreateObject("InternetExplorer.Application")对于每个 f 在 fso.GetFolder(root).Filesie.Navigate "file:///" &f.路径虽然 ie.Busy : WScript.Sleep 100 : Wendtext = ie.document.getElementById("MySection").innerTextWScript.Echo Replace(text, vbNewLine, "")下一个

但是,在 IE 9 之前不支持

标记,即使在 IE 9 中,COM 对象似乎也不能正确处理它,因为 getElementById("MySection") 只返回开始标签:

<预><代码>>>>wsh.echo ie.document.getelementbyid("MySection").outerhtml<SECTION id=MySection>

不过,您可以改用正则表达式:

root = "C:\base\dir"Set fso = CreateObject("Scripting.FileSystemObject")设置 re1 = 新正则表达式re1.Pattern = "
([\s\S]*?)
"re1.Global = Falsere2.IgnoreCase = True设置 re2 = 新正则表达式re2.Pattern = "(
|\s)+"re2.Global = 真re2.IgnoreCase = True对于每个 f 在 fso.GetFolder(root).Fileshtml = fso.OpenTextFile(filename).ReadAll设置 m = re1.Execute(html)如果 m.Count >0 那么text = Trim(re2.Replace(m.SubMatches(0).Value, " "))万一WScript.Echo 文本下一个

I have many HTML files from which I need to extract text. If it's all on one line, I can do that quite easily but if the tag wraps around or is on multiple lines I can't figure how to do this. Here's what I mean:

<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>

I'm not concerned about the <br> text, unless it will help wrap the text around. The area that I want always begins with "MySection" and then is ended with </section>. What I'd like to end up with is something like this:

Some text here  another line here  last line of text.

I'd prefer something like a vbscript or command line option (sed?) but I'm not sure where to begin. Any help?

解决方案

Normally you'd use the Internet Explorer COM object for this:

root = "C:\base\dir"

Set ie = CreateObject("InternetExplorer.Application")

For Each f In fso.GetFolder(root).Files
  ie.Navigate "file:///" & f.Path
  While ie.Busy : WScript.Sleep 100 : Wend

  text = ie.document.getElementById("MySection").innerText

  WScript.Echo Replace(text, vbNewLine, "")
Next

However, the <section> tag is not supported prior to IE 9, and even in IE 9 the COM object doesn't seem to handle it correctly, as getElementById("MySection") only returns the opening tag:

>>> wsh.echo ie.document.getelementbyid("MySection").outerhtml
<SECTION id=MySection>

You could use a regular expression instead, though:

root = "C:\base\dir"

Set fso = CreateObject("Scripting.FileSystemObject")

Set re1 = New RegExp
re1.Pattern = "<section id=""MySection"">([\s\S]*?)</section>"
re1.Global  = False
re2.IgnoreCase = True

Set re2 = New RegExp
re2.Pattern = "(<br>|\s)+"
re2.Global  = True
re2.IgnoreCase = True

For Each f In fso.GetFolder(root).Files
  html = fso.OpenTextFile(filename).ReadAll

  Set m = re1.Execute(html)
  If m.Count > 0 Then
    text = Trim(re2.Replace(m.SubMatches(0).Value, " "))
  End If

  WScript.Echo text
Next

这篇关于提取 HTML 标签之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆