我可以在 scikit-learn 中将 TfidfVectorizer 用于非英语语言吗?另外,我如何在 Python 中阅读非英文文本? [英] Can I use TfidfVectorizer in scikit-learn for non-English language? Also how do I read a non-English text in Python?
问题描述
我必须阅读包含英语和非英语(特别是马拉雅拉姆语)Python 语言的文本文档.以下是我看到的:
I have to read a text document which contains both English and non-English (Malayalam specifically) languages in Python. The following I see:
>>>text_english = 'Today is a good day'
>>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
现在,如果我编写一个代码来使用
Now, if I write a code to extract the first letter using
>>>print(text_english[0])
'T'
当我跑步时
>>>print(text_non_english[0])
�
要得到第一个字母,我必须写以下内容
To get the first letter, I have to write the following
>>>print(text_non_english[0:3])
ആ
为什么会这样?我的目标是提取文本中的单词,以便我可以将其输入到 tfidf 转换器.当我从马拉雅拉姆语创建 tfidf 词汇表时,有些单词是不正确的两个字母.实际上,它们是完整单词的一部分.我该怎么做才能让 tfidf 转换器采用完整的马拉雅拉姆语单词来进行转换,而不是采用两个字母.
Why this happens? My aim to extract the words in the text so that I can input it to the tfidf transformer. When I create the tfidf vocabulary from the Malayalam language, there are words which are two letters which is not correct. Actually they are part of the full words. What should i do so that the tfidf transformer takes the full Malayalam word for the transformation instead of taking two letters.
我为此使用了以下代码
>>>useful_text_1[1:3] # contains both English and Malayalam text
>>>vectorizer = TfidfVectorizer(sublinear_tf=True,max_df=0.5,stop_words='english')
# Learn vocabulary and idf, return term-document matrix
>>>vect_2 = vectorizer.fit_transform(useful_text_1[1:3])
>>>vectorizer.vocabulary_
部分词汇表如下:
ഷമ
സന
സഹ
ർക
ർത
词汇不正确.它没有考虑整个词.如何纠正?
The vocabulary is not correct. It is not considering the whole word. How to rectify this?
推荐答案
使用虚拟分词器实际上对我有用
Using a dummy tokenizer actually worked for me
vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(), min_df=1)
>>> tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
>>> vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(),min_df=1)
>>> vect_2 = vectorizer.fit_transform(tn.split())
>>> for x in vectorizer.vocabulary_:
... print x
...
സന്തോഷമാഗ്രഹിക്കാത്തത
ആരാണു
>>>
这篇关于我可以在 scikit-learn 中将 TfidfVectorizer 用于非英语语言吗?另外,我如何在 Python 中阅读非英文文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!