建议NLP算法文本标记 [英] proposed nlp algorithm for text tagging

查看:159
本文介绍了建议NLP算法文本标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找开源工具,它可以帮助识别标签有关社交媒体,并确定该帖子的主题/跑题或垃圾邮件评论任何用户帖子。即使在找整整一天,我找不到任何合适的工具/库。

I was looking for opensource tool which can help to identify the tags for any user post on social media and identifying topic/off-topic or spam comment on that post. Even after looking for entire day, I could not find any suitable tool/library.

在这里,我提出我自己的算法用于标记属于7类(作业,讨论,事件,物品,服务,买/卖,人才)网友发帖。

Here I have proposed my own algorithm for tagging user post belonging to 7 categories (jobs, discussion, events, articles, services, buy/sell, talents).

首先,当用户发出的帖子,他标记了他的岗位。标签可以像市场营销,建议,恩特雷里奥斯preneurship,跨国公司等。因此,考虑对一些帖子我有标签和他们所属的类别。

Initially when user makes post, he tags his post. Tags can be like marketing, suggestion, entrepreneurship, MNC etc. So consider for some posts I have tags and to which category they belongs.

步骤:

  1. 执行POS(词性)标注用户的帖子。 这里有两个事情可以做。

  1. Perform POS (part of speech) tagging on user post. Here two things can be done.

  • 仅仅考虑名词。名词可能会重新present的后标签更 凭直觉我猜

  • considering only nouns. Nouns may represent the tag for post more intuitively I guess

考虑名词和形容词两种。在这里,我们可以收集大 名词和形容词的数量。这样的话频率可以使用 识别标签的帖子。

Considering Nouns and adjectives both. Here we can collect large numbers of nouns and adjectives. Frequency of such words can be used to identify tag for that post.

对于每个用户定义的标记,我们将收集的POS属于特定标记该职位。例。考虑用户指定的标记营销和后这个标签包含POS字 SEO 的AdWords 。假设分别为标记中包含搜索引擎优化和AdWords的 5倍和7倍10后营销。所以,下一次当用户来到后不具有任何标记,但包含POS字时间 SEO SEO 是发生在销售标签最大次数 7 ,因此,我们将predict 营销标记为这个职位

For each user defined tag, we will collect POS for that post belonging to particular tag. Example. Consider user assigned tag marketing and post for this tag contains POS words SEO and adwords. Suppose 10 post of marketing tag contains SEO and adwords 5 and 7 times respectively. So next time when user post comes which does not have any tag but contains POS word SEO. SEO is occurring maximum times 7 in marketing tag, So we will predict marketing tag for this post

接下来的步骤是确定垃圾邮件或偏离主题的注释POST。 考虑为工作类别的一个网友发帖。此帖包含标记营销。现在我将在数据库中TOP最常见的10-15词性标记的检查(如名词和形容词)进行营销。

NExt steps is for identify spam or off-topic comment for POST. Consider one user post for Job category. This post contains tag marketing. Now I will check in database for TOP most frequent 10-15 Part of speech tags(i.e. nouns and adjective) for marketing.

并行I具有该意见POS标签。我会检查是否POS(名词和放大器;形容词)这个帖子中包含顶级最常见的标签(我们可以考虑15-20这样的POS标签)属于营销

Parallel I have POS tag for that comment. I will check whether POS(noun & adj) of this post contains top most frequent tags(we can consider 15-20 such POS tags) belonging to marketing.

如果POS的意见不与任何最频繁的,顶级的POS营销的匹配,那么该评论可以说题外话/ SPAN

If POS in comments does not match with any of the most frequent, top POS for marketing then that comment can be said off-topic/span

你有任何建议,使这个ALGO更直观??

我想SVM可以帮助进行分类,任何建议的呢?

除了这里面的机器学习技术可以帮助这里学习系统,以predict标签和垃圾邮件(题外话)注释

推荐答案

我看到它与你的特征建模的主要问题。虽然挑出唯一的名词将有助于减少特征空间,这是一个潜在的显著差错率一个额外的步骤。而且你真的不在乎你是否正在寻找市场/ N ,而不是市场/ V

The main problem as I see it is with your feature modeling. While picking out only nouns would help reduce the feature space, it is an extra step with a potentially significant error rate. And do you really care whether you are looking at market/N and not market/V?

使用朴素贝叶斯分类器只是忽略了POS机,并简单地计算每个不同的词形作为一个独立的功能,大多数干线文本分类的实现。 (你也可以做强力制止,以减少市场市场营销来一个单杆形式,因此单一的功能。这往往用英语工作,但可能不是很充足,如果你实际上是工作在一个不同的语言。)

Most mainline text classification implementations using naive bayesian classifiers just ignore the POS, and simply count each distinct word form as an independent feature. (You could also do brute-force stemming to reduce market, markets, and marketing to a single stem form and thus a single feature. This tends to work in English, but might not be very adequate if you are actually working in a different language.)

一个妥协可能是做POS过滤,当你训练你的分类。然后,不具有名词读单词形式结束了一个零分的分类,所以你不必做任何事情,当你使用生成的分类过滤出来。

A compromise could be to do POS filtering when you train your classifier. Then word forms which do not have a noun reading end up with a zero score in the classifier, so you don't have to do anything to filter them out when you use the resulting classifier.

根据经验,SVM趋于实现了高准确度,但它是在复杂的费用,无论是在实现和行为。一个朴素贝叶斯分类器,你可以理解precisely如何得出特定的结论明显的优势。 (好吧,我们大多数凡人不能宣称有同样的把握背后支持向量机的数学。)也许继续一个好方法是将原型与贝叶斯和化解任何扭结,而学习系统作为一个整体的行为,那么也许以后考虑切换到SVM一旦其他部分是稳定的?

Empirically, SVM tends to achieve a high accuracy, but it comes at the cost of complexity, both in implementation and behavior. A naive bayesian classifier has the distinct advantage that you can understand precisely how it arrived at a particular conclusion. (Well, most of us mortals cannot claim to have the same grasp of the mathematics behind SVM.) Perhaps a good way to proceed would be to prototype with Bayes, and iron out any kinks while learning how the system as a whole behaves, then maybe later consider switching to SVM once the other parts are stable?

在垃圾邮件类别将是比任何明确定义内容分类难度。这将是很有诱惑力的建议,任何不符合任何内容类别是题外话,但如果你要使用的判决进行自动垃圾邮件过滤,这很可能导致至少有一些假阳性的早期阶段。一个可能的替代办法是训练分类为特定的垃圾邮件类别 - 。一个是药物治疗,另一种跑鞋等

The "spam" category is going to be harder than any well-defined content category. It would be tempting to suggest that anything which doesn't fit any of your content categories is off-topic, but if you are going to use the verdict for automatic spam filtering, this is likely to cause some false positives at least in the early stages. A possible alternative could be to train classifiers for particular spam categories -- one for medications, another for running shoes, etc.

这篇关于建议NLP算法文本标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆