网页中的关键字 [英] Keywords from a Web Page

查看:78
本文介绍了网页中的关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用C#从网页生成关键字和其计数.

我已经使用HTMLAgilityPack将网页放入字符串中,然后将它们转换为单词,并转换为arraylist.

但是现在,通过删除重复项来过滤关键字,以将其计数添加到一边.

我的代码:

How can i generate keywords and thier count from a webpage using C#.

I have got the web page into string using HTMLAgilityPack and then converted them into words into a arraylist.

But now filter the keyword as adding their counts on the side by removing the duplicate.

My Code:

//Uses HtmlAgilityPack
var webGet = new HtmlWeb();
var doc = webGet.Load(url);

HtmlNode bodyContent = doc.DocumentNode.SelectSingleNode("/html/body");

            if (bodyContent != null)
            {
                pmd.Html = stripHtml(bodyContent.InnerHtml.ToString());                
            }  

string wordsOnly = pmd.Html;

string[] arrayWordsOnly = wordsOnly.Split('' '');                    
                    char[] spChar = new char[] { ''?'', ''\"'', '','', ''\'''', '';'', '':'', ''.'', ''('', '')'', ''!'' };

foreach (string word in arrayWordsOnly)
{
   key = word.Trim(spChar).ToLower();                           
}

protected string stripHtml(string strHtml)
        {
            //Strips the HTML tags from strHTML
            Regex objRegExp = new Regex("<(.|\n)+?>");
            string strOutput;
            //Replace all HTML tag matches with the empty string
            strOutput = objRegExp.Replace(strHtml, "");
            strOutput = strOutput.Replace("<", "<");
            strOutput = strOutput.Replace(">", ">");
            objRegExp = null;
            return strOutput;
        }

推荐答案

首先,您谈论使用ArrayList,因此不再建议这样做.您可能应该使用List<string>( MSDN页面 [ ^ ]).

接受此操作后,应执行以下操作:
Firstly, you talk about using an ArrayList, this is no longer recommended. You should probably use a List<string> (MSDN page[^]).

Accepting that you do this, something like the following should do the trick:
List<string> uniqueWords = new List<string>();
foreach (string word in arrayWordsOnly)
{
   key = word.Trim(spChar).ToLower();
   if (!uniqueWords.Contains(key))
   {
      uniqueWords.Add(key);
   }
}



如果确定要使用ArrayList,则只需将每次出现的List<string>替换为ArrayList



If you are determined to use ArrayList then simply replace each occurrence of List<string> with ArrayList


这篇关于网页中的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆