从< div class ="中提取内容"> < / DIV>标签C#RegEx [英] Extract Content from <div class=" "> </div> Tag C# RegEx
问题描述
我有一个代码
string tag =div;
string pattern = string.Format(@\< {0}。*?\>(?< tegData>。+?)\< \ / {0} \> ;,tag.Trim());
Regex regex = new Regex(pattern,RegexOptions.ExplicitCapture);
MatchCollection matches = regex.Matches(data);
`
我需要< div class =in>之间的内容....< / div>
标签
< div class =in> ;
< a href =/ a / show / 7184569class =mm>ВАЗ2121< / a> < span class =for>за< / span> < span class =price> 2 700 $< / span>< / span>< br />< br class =year> 1990г.< / span>< br />< DIV风格= 余量:3PX 0 3PX 0!> 1.6л,бензин,КППмеханика,спробегом,белый,литыедиски,тонировка,спойлер,ветровики,противотуманки,Движокпослекапитальногоремонта< / DIV>< DIV>
< span style =display:block; padding:4px 0 0 0;>< span class =region>Костанай< / span>< span class =adv-phones> ;,+7(777)4464451< / span>< / span>
< small class =灰色空气> 24просмотра< / small>
< / div>
< div class =selectItemtitle =Выбратьid =fv_sic_7184569>
< a href =#class =fav-buttonid =fav_7184569>& nbsp;< / a> < / DIV>
< / div>
我该怎么做?
我的代码无效。
这是一个正则表达式,可能会提取简单的div标记:
//< div [^>]>(。+?)< / div>
string tag =div;
string pattern = string.Format(@< {0} [^>]>(?< tegData>。+?)< / {0}>,tag.Trim ));
但是,使用RegEx进行HTML解析几乎总是不合适的,并且保证不能正常工作。这仅仅是因为诸如HTML之类的标记语言不是常规语言。
这就是说,使用XML解析器解析文档或片段然后提取所需内容会更好。事实上,使用只向前解析器甚至可能会比尝试使用RegEx更快。
您应该看看 .NET中的XmlReader类。
I have a code`
string tag = "div";
string pattern = string.Format(@"\<{0}.*?\>(?<tegData>.+?)\<\/{0}\>", tag.Trim());
Regex regex = new Regex(pattern, RegexOptions.ExplicitCapture);
MatchCollection matches = regex.Matches(data);
`
and i need to get content between <div class="in"> .... </div>
tags
<div class="in">
<a href="/a/show/7184569" class="mm">ВАЗ 2121</a> <span class="for">за</span> <span class="price">2 700 $</span></span><br/><span class="year">1990 г.</span><br/><div style="margin: 3px 0 3px 0">1.6 л, бензин, КПП механика, с пробегом, белый, литые диски, тонировка, спойлер, ветровики, противотуманки, Движок после капитального ремонта!</div><div>
<span style="display:block; padding: 4px 0 0 0;"><span class="region">Костанай</span><span class="adv-phones">, +7 (777) 4464451</span></span>
<small class="gray air">24 просмотра</small>
<small class="gray air">13 июня</small>
</div>
<div class="selectItem" title="Выбрать" id="fv_sic_7184569">
<a href="#" class="fav-button" id="fav_7184569"> </a> </div>
</div>
How can I do it? My code doesn't work.
Here's a regex that might extract simple div tags:
// <div[^>]*>(.+?)</div>
string tag = "div";
string pattern = string.Format(@"<{0}[^>]*>(?<tegData>.+?)</{0}>", tag.Trim());
However, using RegEx for HTML parsing is almost always inappropriate and guaranteed to not work properly. That is simply because markup languages such as HTML are not regular languages.
That being said you would be much better off using an XML parser to parse the document or fragment and then extract what you need. In fact, using a forward-only parser would probably even be faster than trying to use RegEx.
You should look at the XmlReader class in .NET.
这篇关于从< div class ="中提取内容"> < / DIV>标签C#RegEx的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!