Java 中是否有用于文本分析/挖掘的 API? [英] Are there APIs for text analysis/mining in Java?

查看:34
本文介绍了Java 中是否有用于文本分析/挖掘的 API?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有一个API可以在Java中进行文本分析.可以提取文本中的所有单词、单独的单词、表达式等的东西.可以通知是否找到单词的东西是数字、日期、年份、名称、货币等.

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

我现在开始进行文本分析,所以我只需要一个 API 即可启动.我做了一个网络爬虫,现在我需要一些东西来分析下载的数据.需要统计页面字数、相似字数、数据类型以及与文本相关的其他资源的方法.

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

Java 中是否有用于文本分析的 API?

Are there APIs for text analysis in Java?

文本挖掘,我想挖掘文本.提供此功能的 Java API.

Text-mining, I want to mining the text. An API for Java that provides this.

推荐答案

例如 - 你可以使用标准库 java.text 中的一些类,或者使用 StreamTokenizer (您可以根据您的要求定制它).但是如您所知 - 来自互联网来源的文本数据通常有很多拼写错误,为了获得更好的性能,您必须使用诸如模糊标记器 - java.text 和其他标准实用程序在这种情况下的功能太有限.

For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.

因此,我建议您使用正则表达式(java.util.regex)并根据您的需要创建自己的标记器.

So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.

附言根据您的需要 - 您可以创建状态机解析器来识别原始文本中的模板部分.您可能会在下图中看到简单的状态机识别器(您可以构建更高级的解析器,它可以识别文本中更复杂的模板).

P.S. According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

这篇关于Java 中是否有用于文本分析/挖掘的 API?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆