使用正则表达式从HTML解析出内容？ [英] Parsing out content from HTML using regex?

查看：478 发布时间：2018/6/26 21:48:45 html regex

本文介绍了使用正则表达式从HTML解析出内容？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用正则表达式来查找除了具有特定样式的div内的数据之外的所有内容？例如

 < div style =float：left; padding-left：10px; padding-right：10px> 
< img src =../ Style / BreadCrumbs / Divider.png> 
< / div> 
< div style =float：left; padding-top：5px;> 
数据保持
< / div> 
< div style =float：left; padding-left：10px; padding-right：10px> 
< img src =../ Style / BreadCrumbs / Divider.png> 
< / div>

我希望正则表达式匹配除数据以外的所有内容。我可以看到的最好的方法是只删除html标记，然后将这些文件与vb结合（我已经有了vb的代码）。

我使用的是正则表达式因为我需要从几百个文件中提取数据。

解决方案

您建议的方法可能不是一个很好的方法。如果：

您可以访问 grep

您的grep版本支持perl兼容正则表达式（ PCRE ）

div 只包装您的数据，而不包含其他元素
'data' div 不包含其他 div s

然后您可以使用：

（？s）< div style =float：left; padding-top：5px;>。*？< / div>
这个重要的部分是：

（？s），它激活 DOTALL ，这意味着。会匹配换行符

。*？，它不情愿地匹配div的内容，它会停在第一个< / div> 它找到的位置。

要使用这个，你需要激活一些grep选项：

grep -Pzo $ PATTERN文件
对于这些：

-P 激活 PCRE

-z 替换\\\由 NUL ，所以grep会将整个文件视为一行

-o 仅打印匹配的部分

在此之后，您需要剥离div。 sed 是一个很好的工具。
sed's |< ; / \？div [^>] *> || g'
你可以在一个目录中将所有文件同时加入：

grep -Pzo $ PATTERN * .html | sed's |< / \\？div [^>]> || g'> out.html

How can I use regex to find everything except for data within div with a specific style? e.g.
<div style="float:left;padding-left:10px; padding-right:10px"> <img src="../Style/BreadCrumbs/Divider.png"> </div> <div style="float:left; padding-top:5px;"> Data to keep </div> <div style="float:left;padding-left:10px; padding-right:10px"> <img src="../Style/BreadCrumbs/Divider.png"> </div>
I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)

I'm using regex because I need to extract the data from several hundred files.
解决方案
Your suggested method is probably not a good way to do this. If:

you have access to grep

your version of grep supports perl-compatible regex (PCRE)

this style of div only wraps your data, not other elements

the 'data' div does not contain other divs

Then you can use:
(?s)<div style="float:left; padding-top:5px;">.*?</div>
The important parts of this are:

(?s) which activates DOTALL, which means that . will match newlines

.*? which matches the contents of the div reluctantly, which means it'll stop at the first </div> it finds.

To use this, you'll need to activate a few grep options:
grep -Pzo $PATTERN file
For these:

-P activates the PCRE

-z replaces \n by NUL so grep will treat the entire file as a single line

-o prints only the matching parts

After this you'll need to strip off the divs. sed is a good tool for this.
sed 's|</\?div[^>]*>||g'
If you put all of your files in one directory you can do the joining at the same time:
grep -Pzo $PATTERN *.html | sed 's|</\?div[^>]*>||g' > out.html

这篇关于使用正则表达式从HTML解析出内容？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用正则表达式从HTML解析出内容？ [英] Parsing out content from HTML using regex?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用正则表达式从HTML解析出内容？ [英] Parsing out content from HTML using regex?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭