通过java中的标点符号和空格等通过正则表达式拆分字符串 [英] Splitting strings through regular expressions by punctuation and whitespace etc in java
问题描述
我将此文本文件读入Java应用程序,然后逐行计算其中的单词。现在我将这些行拆分为单词
I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a
String.split([\\p{Punct}\\s+])"
但我知道我错过了文本文件中的一些单词。例如,单词can not应分为两个单词can和t。
But I know I am missing out on some words from the text file. For example, the word "can't" should be divided into two words "can" and "t".
逗号和其他标点符号应完全被忽略并视为我一直试图理解如何形成一个更精确的正则表达式来做到这一点,但我是一个新手,所以我需要一些帮助。
Commas and other punctuation should be completely ignored and considered as whitespace. I have been trying to understand how to form a more precise Regular Expression to do this but I am a novice when it comes to this so I need some help.
对于我描述的目的,什么可以是更好的正则表达式?
What could be a better regex for the purpose I have described?
推荐答案
你的正则表达式中有一个小错误。试试这个:
You have one small mistake in your regex. Try this:
String[] Res = Text.split("[\\p{Punct}\\s]+");
[\\\\ {{Punct} \\\ \\ s] +
将字符类中的 +
表单移到外面。另外,你也在 +
上拆分,并且不要连续组合拆分字符。
[\\p{Punct}\\s]+
move the +
form inside the character class to the outside. Other wise you are splitting also on a +
and do not combine split characters in a row.
所以我得到了对于此代码
So I get for this code
String Text = "But I know. For example, the word \"can\'t\" should";
String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
System.out.println(s);
}
此结果
10
但是
我
知道
例子
字
可以
t
应该
10
But
I
know
For
example
the
word
can
t
should
哪个符合您的要求。
作为替代方案,您可以使用
As an alternative you can use
String[] Res = Text.split("\\P{L}+");
\\\\ {L}
表示不是具有Letter属性的unicode代码点
\\P{L}
means is not a unicode code point that has the property "Letter"
这篇关于通过java中的标点符号和空格等通过正则表达式拆分字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!