网页中的关键字 [英] Keywords from a Web Page
本文介绍了网页中的关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何使用C#从网页生成关键字和其计数.
我已经使用HTMLAgilityPack将网页放入字符串中,然后将它们转换为单词,并转换为arraylist.
但是现在,通过删除重复项来过滤关键字,以将其计数添加到一边.
我的代码:
How can i generate keywords and thier count from a webpage using C#.
I have got the web page into string using HTMLAgilityPack and then converted them into words into a arraylist.
But now filter the keyword as adding their counts on the side by removing the duplicate.
My Code:
//Uses HtmlAgilityPack
var webGet = new HtmlWeb();
var doc = webGet.Load(url);
HtmlNode bodyContent = doc.DocumentNode.SelectSingleNode("/html/body");
if (bodyContent != null)
{
pmd.Html = stripHtml(bodyContent.InnerHtml.ToString());
}
string wordsOnly = pmd.Html;
string[] arrayWordsOnly = wordsOnly.Split('' '');
char[] spChar = new char[] { ''?'', ''\"'', '','', ''\'''', '';'', '':'', ''.'', ''('', '')'', ''!'' };
foreach (string word in arrayWordsOnly)
{
key = word.Trim(spChar).ToLower();
}
protected string stripHtml(string strHtml)
{
//Strips the HTML tags from strHTML
Regex objRegExp = new Regex("<(.|\n)+?>");
string strOutput;
//Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strHtml, "");
strOutput = strOutput.Replace("<", "&lt;");
strOutput = strOutput.Replace(">", "&gt;");
objRegExp = null;
return strOutput;
}
推荐答案
首先,您谈论使用ArrayList
,因此不再建议这样做.您可能应该使用List<string>
( MSDN页面 [ ^ ]).
接受此操作后,应执行以下操作:
Firstly, you talk about using anArrayList
, this is no longer recommended. You should probably use aList<string>
(MSDN page[^]).
Accepting that you do this, something like the following should do the trick:
List<string> uniqueWords = new List<string>();
foreach (string word in arrayWordsOnly)
{
key = word.Trim(spChar).ToLower();
if (!uniqueWords.Contains(key))
{
uniqueWords.Add(key);
}
}
如果确定要使用ArrayList
,则只需将每次出现的List<string>
替换为ArrayList
If you are determined to use ArrayList
then simply replace each occurrence of List<string>
with ArrayList
这篇关于网页中的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文