用“。”分割字符串。 (点)处理缩写时 [英] Split string with "." (dot) while handling abbreviations

查看:139
本文介绍了用“。”分割字符串。 (点)处理缩写时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现这很难解释,所以我将开始介绍我想要实现的前后几个例子。

I'm finding this fairly hard to explain, so I'll kick off with a few examples of before/after of what I'd like to achieve.

输入示例


Hello.World

Hello.World

This.Is.A.Test

This.Is.A.Test

The.SWATTeam

The.S.W.A.T.Team

SWAT

swat

2001.A.Space.Odyssey

2001.A.Space.Odyssey

通缉输出:


Hello World

Hello World

这是一个测试

SWAT团队

SWAT

swat

2001 A Space Odyssey

2001 A Space Odyssey

基本上,我想创建一些能够按点分割字符串的东西,但同时处理缩写。

Essentially, I'd like to create something that's capable of splitting strings by dots, but at the same time handles abbreviations.

我对缩写的定义是至少两个字符(套管不相关)和tw o点,即A.B.或a.b.。它不应该使用数字,即1.a。。

My definition of an abbreviation is something that has at least two characters (casing irrelevant) and two dots, i.e. "A.B." or "a.b.". It shouldn't work with digits, i.e. "1.a.".

我用正则表达式尝试了各种各样的东西,但是这不完全是我的强项,所以我希望这里的某个人有任何我可以使用的想法或指示。

I've tried all kinds of things with regex, but it isn't exactly my strong suit, so I'm hoping that someone here has any ideas or pointers that I can use.

推荐答案

如何用正则表达式删除需要消失的点,然后用空格替换其余的点?正则表达式看起来像(?< =(^ | [。])[\\S&& \\ D])[。](?= [\\S& ;& \\D]([。] | $))

How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).

String[] data = { 
        "Hello.World", 
        "This.Is.A.Test", 
        "The.S.W.A.T.Team",
        "S.w.a.T.", 
        "S.w.a.T.1", 
        "2001.A.Space.Odyssey" };

for (String s : data) {
    System.out.println(s.replaceAll(
            "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
            .replace('.', ' '));
}

结果

Hello World
This Is A Test
The SWAT Team
SwaT 
SwaT 1
2001 A Space Odyssey

在正则表达式中我需要逃避点字符的特殊含义。我可以用 \\。来做,但我更喜欢 [。]

In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

因此,在正则表达式中,我们有点字面值。现在这个点被(?< = ...)(?= ...)包围。这些是名为 look-behind 环视机制的一部分和预见

So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.


  • 由于需要删除的点有点(或数据开头 ^ )和一些非空格 \\\\ 这也是非数字\D字符之前我可以使用<$ c $进行测试C>(小于=(^ |)?[\\S&安培;&安培; \\D] [。]) [。]。

  • Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].

此外,需要删除的点还包含非空白和非数字字符以及另一个点(可选择数据结尾 $ )之后,可以写成 [。](?= [\\S&& \\ D]([。] | $))

Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))

取决于需要 [\\S&& \\ D] 除了字母之外,它还匹配等字符!@#$%^& *()-_ = + ... 可以替换为 [a-zA-Z] 仅用英文字母或<$ c对于Unicode中的所有字母,$ c> \\p {IsAlphabetic} 。

Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.

这篇关于用“。”分割字符串。 (点)处理缩写时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆