如何删除两个单词之间的字符串 [英] How to remove string between two words
问题描述
我正在使用以下代码行下载网页,
I am downloading web pages using below lines of code,
WebRequest request = WebRequest.Create(strURL);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
然后从这里提取身体部位,如下所示:
then from here I am extracting body part as below:
int nBodyStart = downloadString.IndexOf("<body");
int nBodyEnd = downloadString.LastIndexOf("</body>");
String strBody = downloadString.Substring(nBodyStart, (nBodyEnd - nBodyStart + 7));
现在我要删除正文部分附带的所有javascript,我该怎么做?
Now I want to remove any javascript attached in the body part, How can I do that?
我的目标是获取网页的唯一内容.但是由于每个页面的处理方式可能不同,所以我尝试删除所有js标签,然后使用下面的RegEx删除所有HTML标签
My aim to get the only contents of the web page. But as each page may have different approach, so I am trying to remove any js tags and then remove any HTML tags using below RegEx
Regex.Replace(strBody, @"<[^>]+>| ", "").Trim();
但是我不知道如何删除脚本标记之间的js,因为脚本可能是多行或单行.
But I don't know how to remove js between script tags as the script may be multi-line or single line.
谢谢.
推荐答案
要匹配脚本标签(包括标签对的内部),请使用以下命令:
To match script tags (including the inside of the pair), use the following:
<script[^>]*>(.*?)</script>
要匹配所有HTML标记(但不能匹配该对的内部标记),您可以使用:
To match all HTML tags (but not the inside of the pair) you can use:
</?[a-z][a-z0-9]*[^<>]*>
我刚刚意识到您可能也想删除样式标签:
I just realised you might also want to remove style tags too:
<style[^>]*>(.*?)</style>
完整的正则表达式字符串在这里:
Full regular expression string here:
<script[^>]*>(.*?)</script>|<style[^>]*>(.*?)</style>|</?[a-z][a-z0-9]*[^<>]*>|<[^>]+>|
这篇关于如何删除两个单词之间的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!