从文本文件中提取单词 [英] Extract words out of a text file

查看:181
本文介绍了从文本文件中提取单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一个类似这样的文本文件:
http ://www.gutenberg.org/files/17921/17921-8.txt

Let's say you have a text file like this one: http://www.gutenberg.org/files/17921/17921-8.txt

有没有人有一个好的算法或开源代码,从文本文件中提取单词?
如何获取所有单词,同时避免使用特殊字符,并保留it's等等...

Does anyone has a good algorithm, or open-source code, to extract words from a text file? How to get all the words, while avoiding special characters, and keeping things like "it's", etc...

我正在使用Java。
谢谢

I'm working in Java. Thanks

推荐答案

这听起来像是正则表达式的正确工作。这里有一些Java代码可以给你一个想法,万一你不知道如何开始:

This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:

String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

模式 [\ w'] + 多次匹配所有单词字符和撇号。示例字符串将逐字打印。请查看 Java Pattern类文档阅读更多。

The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.

这篇关于从文本文件中提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆