使用Python中的文本分析对企业进行分类 [英] Categories Busineesses with Text analytics in Python

查看：517 发布时间：2020/5/4 10:09:49 python machine-learning artificial-intelligence text-mining text-analysis

本文介绍了使用Python中的文本分析对企业进行分类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是AI的新手，并且想要执行以下练习.您能否建议使用python实现的方法:

I'm a new-bee to AI and want to perform the below exercise. Can you please suggest the way to achieve it using python:

场景- 我有一些公司的业务清单，如下所示:

Scenario - I have list of businesses of some companies as below like:

 1. AI
 2. Artificial Intelligence
 3. VR
 4. Virtual reality
 5. Mobile application
 6. Desktop softwares

，并希望将其分类如下:

and want to categorize them as below:

 Technology                 ---> Category
 1. AI                      ---> Category Artificial Intelligence
 2. Artificial Intelligence ---> Category Artificial Intelligence
 3. VR                      ---> Category Virtual Reality
 4. Virtual reality         ---> Category Virtual Reality
 5. Mobile application      ---> Category Application
 6. Desktop softwares       ---> Category Application

即，当我收到诸如 AI 或 Artificial Intelligence 之类的文字时，则必须标识AI&人工智能既相同又将两个关键字都放在人工智能"类别下.

i.e when I receive a text like AI or Artificial Intelligence, then it must identify AI & Artificial Intelligence as one and the same and put both keywords under Artificial Intelligence category.

我当前使用的方法是使用查找表，但是，我想使用python在上述输入的技术/业务上应用文本分类，在这里我可以隔离技术而不是使用查找表.

The current approach I follow is using the lookup a table but, I want to apply TEXT CLASSIFICATION on the technologies/business for the above input using python where I can segregate the technologies instead of using the lookup table.

请建议我任何相关方法.

Please suggest me any relevant approach.

推荐答案

这是使用sklearn的一种方法.在过去的情况下，我会使用 LabelBinarizer()但它不能在管道中工作，因为它不再接受X，y作为输入.

Here's one approach using sklearn. In past cases, I would use LabelBinarizer() but it won't work in a pipeline because it no-longer accepts X, y as inputs.

如果您是新手，管道可能会有些混乱，但从本质上讲，它们只是在传递给分类器之前分步处理数据.在这里，我将X转换为单词和字符标记的ngram矩阵"(表)，然后将其传递给分类器.

If you are a newbie, pipelines can be a bit confusing but essentially they just process the data in steps before passing to a classifier. Here, I am converting X into an ngram "matrix" (a table) of word and character tokens, and then passing that to a classifier.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion

X = np.array([['AI'],
       ['Artificial Intelligence'],
       ['VR'],
       ['Virtual Reality'],
       ['Mobile application'],
       ['Desktop softwares']])
y = np.array(['Artificial Intelligence', 'Artificial Intelligence',
       'Virtual Reality', 'Virtual Reality', 'Application', 'Application'])

pipeline = Pipeline(steps=[
    ('union', FeatureUnion([
        ('word_vec', CountVectorizer(binary=True, analyzer='word', ngram_range=(1,2))),
        ('char_vec', CountVectorizer(analyzer='char', ngram_range=(2,5)))
        ])),
    ('lreg', LogisticRegression())
    ])

pipeline.fit(X.ravel(), y)
print(pipeline.predict(['web application', 'web app', 'dog', 'super intelligence']))

预测:

['Application' 'Application' 'Virtual Reality' 'Artificial Intelligence']

这篇关于使用Python中的文本分析对企业进行分类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Python中的文本分析对企业进行分类 [英] Categories Busineesses with Text analytics in Python

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

使用Python中的文本分析对企业进行分类 [英] Categories Busineesses with Text analytics in Python

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭