PHP中的关键字分析 [英] Keyword analysis in PHP

查看:159
本文介绍了PHP中的关键字分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我正在构建的Web应用程序,我需要分析一个网站,检索并排列它最重要的关键字并显示它们。



获取所有单词,它们的密度并显示这些是相对简单的,但这给了非常扭曲的结果(例如,停用词排名非常高)。

基本上,我的问题是:如何创建关键字分析工具在PHP中,这会导致按字重要性正确排序列表?

解决方案

最近,我一直在自己做这个工作,



步骤




  1. 过滤文本

  2. 拆分为单词

  3. 删除2个字符单词和停用词

  4. 确定单词频率+密度

  5. 确定单词突显性

  6. 确定单词容器


    1. 标题

    2. 元描述

    3. URL

    4. 标题

    5. 元关键字


  7. 计算关键字值



  8. 1。过滤文本



    您需要做的第一件事是过滤确保编码是正确的,因此convert为UTF-8:

      iconv($ encoding,utf-8,$ file); //其中$ encoding是当前编码

    之后,您需要去除所有html标签,标点符号,符号和数字。
    在Google上查找如何操作的功能!



    2。拆分为单词



      $ words = mb_split('+',$ text); 



    3。删除2个字符和停用词



    由1或2个字符组成的任何单词都没有任何意义,因此我们将其全部删除。



    要删除停用词,我们首先需要检测语言。
    有几种方法可以做到这一点:$ b​​ $ b - 检查Content-Language HTTP头
    - 检查lang =或xml:lang =属性
    - 检查语言和内容语言元数据标签
    如果没有设置这些标签,则可以使用外部API,如 AlchemyAPI

    您需要每种语言的停用词汇列表,这些列表可以在网络上轻松找到。
    我一直在使用这个: http://www.ranks.nl/resources/stopwords.html



    4。确定词频+密度



    要计算每个单词的出现次数,请使用以下内容:

      $ uniqueWords = array_unique($ keywords); // $关键字是经过筛选后的$ words数组,如步骤3所述
    $ uniqueWordCounts = array_count_values($ words);

    现在循环遍历$ uniqueWords数组并计算每个单词的密度,如下所示:

      $ density = $ frequency / count($ words)* 100; 



    5。确定单词突显



    突出单词由文本内单词的位置定义。
    例如,第一句中的第二个单词可能比第83个句子中的第6个单词重要。



    要计算它,请将此代码添加到与上一步相同的循环:'

      $ keys = array_keys($ words,$ word); // $ word是我们当前在循环中的单词
    $ positionSum = array_sum($ keys)+ count($ keys);
    $ prominence =(count($ words) - (($ positionSum - 1)/ count($ keys)))*(100 / count($ words));



    6。确定单词容器



    一个非常重要的部分是确定单词所在的位置 - 标题,描述和其他内容。



    首先,您需要使用DOMDocument或PHPQuery( dont 尝试使用正则表达式)来获取标题,所有元数据标记和所有标题。
    然后,您需要检查相同的循环,不管它们是否包含单词。


    7。计算关键字值



    最后一步是计算关键字值。
    要做到这一点,您需要权衡每个因素 - 密度,突出和容器。
    例如:

    $ p $ $ value =(double)((1 + $ density)*($ prominence / 10 ))*(1 +(0.5 * count($ containers)));

    这个计算并不完美,但它应该给您带来不错的结果。



    结论



    我没有提及我在工具中使用的每一个细节,但我希望它提供了一个很好的视图关键字分析。



    NB是的,这受到了今天的博文的启发,关于回答你自己的问题!


    For a web application I'm building I need to analyze a website, retrieve and rank it's most important keywords and display those.

    Getting all words, their density and displaying those is relatively simple, but this gives very skewed results (e.g. stopwords ranking very high).

    Basically, my question is: How can I create a keyword analysis tool in PHP which results in a list correctly ordered by word importance?

    解决方案

    Recently, I've been working on this myself, and I'll try to explain what I did as best as possible.

    Steps

    1. Filter text
    2. Split into words
    3. Remove 2 character words and stopwords
    4. Determine word frequency + density
    5. Determine word prominence
    6. Determine word containers

      1. Title
      2. Meta description
      3. URL
      4. Headings
      5. Meta keywords

    7. Calculate keyword value

    1. Filter text

    The first thing you need to do is filter make sure the encoding is correct, so convert is to UTF-8:

    iconv ($encoding, "utf-8", $file); // where $encoding is the current encoding
    

    After that, you need to strip all html tags, punctuation, symbols and numbers. Look for functions on how to do this on Google!

    2. Split into words

    $words = mb_split( ' +', $text );
    

    3. Remove 2 character words and stopwords

    Any word consisting of either 1 or 2 characters won't be of any significance, so we remove all of them.

    To remove stopwords, we first need to detect the language. There are a couple of ways we can do this: - Checking the Content-Language HTTP header - Checking lang="" or xml:lang="" attribute - Checking the Language and Content-Language metadata tags If none of those are set, you can use an external API like the AlchemyAPI.

    You will need a list of stopwords per language, which can be easily found on the web. I've been using this one: http://www.ranks.nl/resources/stopwords.html

    4. Determine word frequency + density

    To count the number of occurrences per word, use this:

    $uniqueWords = array_unique ($keywords); // $keywords is the $words array after being filtered as mentioned in step 3
    $uniqueWordCounts = array_count_values ( $words );
    

    Now loop through the $uniqueWords array and calculate the density of each word like this:

    $density = $frequency / count ($words) * 100;
    

    5. Determine word prominence

    The word prominence is defined by the position of the words within the text. For example, the second word in the first sentence is probably more important than the 6th word in the 83th sentence.

    To calculate it, add this code within the same loop from the previous step:'

    $keys = array_keys ($words, $word); // $word is the word we're currently at in the loop
    $positionSum = array_sum ($keys) + count ($keys);
    $prominence = (count ($words) - (($positionSum - 1) / count ($keys))) * (100 /   count ($words));
    

    6. Determine word containers

    A very important part is to determine where a word resides - in the title, description and more.

    First, you need to grab the title, all metadata tags and all headings using something like DOMDocument or PHPQuery (dont try to use regex!) Then you need to check, within the same loop, whether these contain the words.

    7. Calculate keyword value

    The last step is to calculate a keywords value. To do this, you need to weigh each factor - density, prominence and containers. For example:

    $value = (double) ((1 + $density) * ($prominence / 10)) * (1 + (0.5 * count ($containers)));
    

    This calculation is far from perfect, but it should give you decent results.

    Conclusion

    I haven't mentioned every single detail of what I used in my tool, but I hope it offers a good view into keyword analysis.

    N.B. Yes, this was inspired by the today's blogpost about answering your own questions!

    这篇关于PHP中的关键字分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆