使用Python或Java进行文本处理(文本挖掘，信息检索，自然语言处理) [英] Python or Java for text processing (text mining, information retrieval, natural language processing)

查看：583 发布时间：2020/5/18 0:45:48 java python nlp information-retrieval text-mining

本文介绍了使用Python或Java进行文本处理(文本挖掘，信息检索，自然语言处理)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很快将开始一个新项目，在该项目中，我将执行许多文本处理任务，例如搜索，分类/分类，聚类等.

I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on.

将需要处理大量文档；可能数以百万计的文档.经过初步处理后，还必须每天使用多个新文档进行更新.

There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the initial processing, it also has to be able to be updated daily with multiple new documents.

我可以使用Python做到这一点，还是Python太慢?最好使用Java吗?

Can I use Python to do this, or is Python too slow? Is it best to use Java?

如果可能的话，我会更喜欢Python，因为这是我最近一直在使用的Python.另外，我会更快地完成编码部分.但这一切都取决于Python的速度.我已经使用Python完成了一些只有两千个文档的小规模文本处理任务，但是我不确定它的扩展程度如何.

If possible, I would prefer Python since that's what I have been using lately. Plus, I would finish the coding part much faster. But it all depends on Python's speed. I have used Python for some small scale text processing tasks with only a couple of thousand documents, but I am not sure how well it scales up.

推荐答案

两者都很好. Java在文本处理方面投入了很多精力. 斯坦福的文本处理系统， UIMA 和

Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATE seem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it).

NLTK ，样式，和其他许多Python模块都非常擅长文本处理.它们的内存使用情况和性能非常合理.

NLTK, Gensim, Pattern, and many other Python modules are very good at text processing. Their memory usage and performance are very reasonable.

Python得以扩展，因为文本处理是一个非常容易扩展的问题.解析/标记/分块/提取文档时，可以非常轻松地使用多处理.一旦您将文本输入任何种类的特征向量，便可以使用numpy数组，我们都知道numpy有多伟大...

Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents. Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is...

我从NLTK中学到了知识，Python在减少开发时间方面为我提供了极大的帮助，因此我认为您应该先尝试一下.他们也有一个非常有用的邮件列表，建议您加入.

I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join.

如果您有自定义脚本，则可能要检查它们在 PyPy 中的表现如何.

If you have custom scripts, you might want to check out how well they perform with PyPy.

这篇关于使用Python或Java进行文本处理(文本挖掘，信息检索，自然语言处理)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Python或Java进行文本处理(文本挖掘，信息检索，自然语言处理) [英] Python or Java for text processing (text mining, information retrieval, natural language processing)

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用Python或Java进行文本处理(文本挖掘，信息检索，自然语言处理) [英] Python or Java for text processing (text mining, information retrieval, natural language processing)

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭