使用Python或Java进行文本处理(文本挖掘,信息检索,自然语言处理) [英] Python or Java for text processing (text mining, information retrieval, natural language processing)

查看:583
本文介绍了使用Python或Java进行文本处理(文本挖掘,信息检索,自然语言处理)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很快将开始一个新项目,在该项目中,我将执行许多文本处理任务,例如搜索,分类/分类,聚类等.

I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on.

将需要处理大量文档;可能数以百万计的文档.经过初步处理后,还必须每天使用多个新文档进行更新.

There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the initial processing, it also has to be able to be updated daily with multiple new documents.

我可以使用Python做到这一点,还是Python太慢?最好使用Java吗?

Can I use Python to do this, or is Python too slow? Is it best to use Java?

如果可能的话,我会更喜欢Python,因为这是我最近一直在使用的Python.另外,我会更快地完成编码部分.但这一切都取决于Python的速度.我已经使用Python完成了一些只有两千个文档的小规模文本处理任务,但是我不确定它的扩展程度如何.

If possible, I would prefer Python since that's what I have been using lately. Plus, I would finish the coding part much faster. But it all depends on Python's speed. I have used Python for some small scale text processing tasks with only a couple of thousand documents, but I am not sure how well it scales up.

推荐答案

两者都很好. Java在文本处理方面投入了很多精力. 斯坦福的文本处理系统, UIMA

Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATE seem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it).

NLTK 样式,和其他许多Python模块都非常擅长文本处理.它们的内存使用情况和性能非常合理.

NLTK, Gensim, Pattern, and many other Python modules are very good at text processing. Their memory usage and performance are very reasonable.

Python得以扩展,因为文本处理是一个非常容易扩展的问题.解析/标记/分块/提取文档时,可以非常轻松地使用多处理.一旦您将文本输入任何种类的特征向量,便可以使用numpy数组,我们都知道numpy有多伟大...

Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents. Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is...

我从NLTK中学到了知识,Python在减少开发时间方面为我提供了极大的帮助,因此我认为您应该先尝试一下.他们也有一个非常有用的邮件列表,建议您加入.

I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join.

如果您有自定义脚本,则可能要检查它们在 PyPy 中的表现如何.

If you have custom scripts, you might want to check out how well they perform with PyPy.

这篇关于使用Python或Java进行文本处理(文本挖掘,信息检索,自然语言处理)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆