使用正则表达式从HTML解析出内容? [英] Parsing out content from HTML using regex?
问题描述
如何使用正则表达式来查找除了具有特定样式的div内的数据之外的所有内容?例如
< div style =float:left; padding-left:10px; padding-right:10px>
< img src =../ Style / BreadCrumbs / Divider.png>
< / div>
< div style =float:left; padding-top:5px;>
数据保持
< / div>
< div style =float:left; padding-left:10px; padding-right:10px>
< img src =../ Style / BreadCrumbs / Divider.png>
< / div>
我希望正则表达式匹配除数据以外的所有内容。我可以看到的最好的方法是只删除html标记,然后将这些文件与vb结合(我已经有了vb的代码)。
我使用的是正则表达式因为我需要从几百个文件中提取数据。
您建议的方法可能不是一个很好的方法。如果:
- 您可以访问
grep
- 您的grep版本支持perl兼容正则表达式(
PCRE
) -
div
只包装您的数据,而不包含其他元素
- 'data'
div
不包含其他div
s
- 'data'
然后您可以使用:
(?s)< div style =float:left; padding-top:5px;>。*?< / div>
这个重要的部分是:
-
(?s)
,它激活DOTALL
,这意味着。
会匹配换行符 -
。*?
,它不情愿地匹配div的内容,它会停在第一个< / div>
它找到的位置。
要使用这个,你需要激活一些grep选项:
grep -Pzo $ PATTERN文件
对于这些:
-
-P
激活PCRE
-
-z
替换\\\
由
NUL
,所以grep会将整个文件视为一行 -
-o
仅打印匹配的部分
在此之后,您需要剥离div。 sed
是一个很好的工具。
sed's |< ; / \?div [^>] *> || g'
你可以在一个目录中将所有文件同时加入:
grep -Pzo $ PATTERN * .html | sed's |< / \\?div [^>]> || g'> out.html
How can I use regex to find everything except for data within div with a specific style? e.g.
<div style="float:left;padding-left:10px; padding-right:10px">
<img src="../Style/BreadCrumbs/Divider.png">
</div>
<div style="float:left; padding-top:5px;">
Data to keep
</div>
<div style="float:left;padding-left:10px; padding-right:10px">
<img src="../Style/BreadCrumbs/Divider.png">
</div>
I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)
I'm using regex because I need to extract the data from several hundred files.
Your suggested method is probably not a good way to do this. If:
- you have access to
grep
- your version of grep supports perl-compatible regex (
PCRE
) - this style of
div
only wraps your data, not other elements - the 'data'
div
does not contain otherdiv
s
Then you can use:
(?s)<div style="float:left; padding-top:5px;">.*?</div>
The important parts of this are:
(?s)
which activatesDOTALL
, which means that.
will match newlines.*?
which matches the contents of the div reluctantly, which means it'll stop at the first</div>
it finds.
To use this, you'll need to activate a few grep options:
grep -Pzo $PATTERN file
For these:
-P
activates thePCRE
-z
replaces\n
byNUL
so grep will treat the entire file as a single line-o
prints only the matching parts
After this you'll need to strip off the divs. sed
is a good tool for this.
sed 's|</\?div[^>]*>||g'
If you put all of your files in one directory you can do the joining at the same time:
grep -Pzo $PATTERN *.html | sed 's|</\?div[^>]*>||g' > out.html
这篇关于使用正则表达式从HTML解析出内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!