从文本块中提取相关标签/关键字 [英] Extract Relevant Tag/Keywords from Text block

查看:53
本文介绍了从文本块中提取相关标签/关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一个特定的实现,以便用户提供一个文本块,例如:

I wanted a particular implementation, such that the user provide a block of text like:

"要求- 使用 Linux、Apache 2 的 LAMP 环境的工作知识,MySQL 5 和 PHP 5,- 了解 Web 2.0 标准- 熟悉 JSON- 使用框架、Zend、OOP 的实践经验- 跨浏览器 Javascripting、JQuery 等.- 了解子版本等版本控制软件最好."

"Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable."

我想要做的是自动选择相关关键字并创建标签/关键字,因此对于上面的一段文字,相关标签应该是:mysql、php、json、jquery、版本控制、oop、web2.0、javascript

What I want to do is automatically select relevant keywords and create tags/keywords, hence for the above piece of text, relevant tags should be: mysql, php, json, jquery, version control, oop, web2.0, javascript

我怎样才能在 PHP/Javascript 等中做到这一点?抢先一步会非常有帮助.

How can I go about doing it in PHP/Javascript etc? A headstart would be really helpful.

推荐答案

一个非常幼稚的方法是删除常见的 停用词,为您留下更有意义的词,例如标准"、JSON"等.但是您仍然会收到很多噪音,因此您可以考虑使用像 OpenCalais 可以对您的文本进行相当复杂的分析.

A very naive method is to remove common stopwords from the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalais which can do a rather sophisticated analysis of your text.

更新:

好的,我之前回答中的链接指向了实现,但您要求的是一个,所以这里有一个简单的:

Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

你可以看到这个,以及这个Gist中stop_word.txt的内容一>.

You can see this, and the contents of stop_word.txt in this Gist.

在您的示例文本上运行上面的代码会生成以下数组:

Running the above on your example text produces the following array:

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

所以,就像我说的,这有点幼稚,可以使用更多优化(而且速度很慢),但它确实从您的文本中提取了更相关的关键字.您还需要对停用词进行一些微调.捕获像 Web 2.0 这样的术语会非常困难,所以我再次认为你最好使用像 OpenCalais 这样的严肃服务,它可以理解文本并返回实体和引用列表.DocumentCloud 依靠这项服务从文档中收集信息.

So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0 will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloud relies on this very service to gather information from documents.

此外,对于客户端实现,您可以使用 JavaScript 做几乎相同的事情,而且可能更简洁(尽管对客户端来说可能会很慢.)

Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)

这篇关于从文本块中提取相关标签/关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆