java正则表达式过滤掉非英文文本 [英] java regex to filter out non-English text

查看:1979
本文介绍了java正则表达式过滤掉非英文文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了一些关于正则表达式过滤掉非英语的引用,但 none 是Java中的,除了它们都指的是不同的问题比我想要解决的问题:

I found a few references to regex filtering out non-English but none of them is in Java, aside from the fact that they are all referring to somewhat different problems than what I am trying to solve:


  1. 替换所有非英文字符带有空格的

  2. 创建一个返回 true的方法
    如果字符串包含任何非英语
    字符。

英文文本不仅指实际字母和数字,还指标点符号。

By "English text" I mean not only actual letters and numbers but also punctuation.

到目前为止,我能够为目标#1带来的非常简单:

So far, what I have been able to come with for goal #1 is quite simple:

String.replaceAll("\\W", " ")

实际上,这么简单,我怀疑我错过了什么......您是否发现上述任何警告?

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

至于目标#2,我可以简单地 trim() 上面的 replaceAll()之后的字符串,然后检查它是否为空。但是......有更有效的方法吗?

As for goal #2, I could simply trim() the string after the above replaceAll(), then check if it's empty. But... Is there a more efficient way to do this?

推荐答案


事实上,所以很简单,我怀疑我错过了什么...你在上面发现任何警告吗?

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

\ W 相当于 [^ \w] \w 相当于 [a-zA-Z_0-9] 。使用 \W 将替换所有,这不是字母,数字或下划线—喜欢标签和换行符。这个问题是否真的取决于你。

\W is equivalent to [^\w], and \w is equivalent to [a-zA-Z_0-9]. Using \W will replace everything which isn't a letter, a number, or an underscore — like tabs and newline characters. Whether or not that's a problem is really up to you.


英文文本我的意思不仅仅是实际的字母和数字,还有标点符号。

By "English text" I mean not only actual letters and numbers but also punctuation.

在这种情况下,您可能希望使用省略标点符号的字符类;类似

In that case, you might want to use a character class which omits punctuation; something like

[^\w.,;:'"]




创建一个方法,如果字符串包含任何非英文字符,则返回true。

Create a method that returns true if a string contains any non-English character.

使用 模式 匹配器

Pattern p = Pattern.compile("\\W");

boolean containsSpecialChars(String string)
{
    Matcher m = p.matcher(string);
    return m.find();
}

这篇关于java正则表达式过滤掉非英文文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆