在Java正则表达式中匹配Unicode破折号? [英] Matching Unicode Dashes in Java Regular Expressions?

查看:65
本文介绍了在Java正则表达式中匹配Unicode破折号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试制作一个Java正则表达式,以使用Pattern.split()将常规格式"foo-bar"的字符串拆分为"foo"和"bar".-"字符可能是以下几个破折号之一:ASCII'-',em-破折号,en-破折号等.我构造了以下正则表达式:

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

,如果我正确阅读了Pattern文档,当在两边都用空格包围时,应该捕获任何Unicode破折号或ASCII破折号.我使用的模式如下:

which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

不高兴.对于以下示例输入,未检测到破折号,并且titleSegmentSeparator.matcher(sectionTitle).find()返回false!

No joy. For the sample input below, the dash is not detected, and titleSegmentSeparator.matcher(sectionTitle).find() returns false!

为了确保我没有丢失任何不寻常的字符实体,我使用System.out打印了一些调试信息.输出如下-每个字符后跟(int)char的输出,该字符应该是其Unicode代码点,不是吗?

In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?

样本输入:

研究摘要(10之1)–竞争

Study Summary (1 of 10) – Competition

S(83)t(116)u(117)d(100)y(121)(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)(32)((40)1(49)(32)o(111)f(102)(32)1(49)0(48))(41)(32)–(8211)(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

在我看来,破折号是代码点8211,应由正则表达式匹配,但事实并非如此!这是怎么回事?

It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

推荐答案

您正在混合使用十进制( 8211 )和十六进制( 0x8211 ).

You're mixing decimal (8211) and hexadecimal (0x8211).

\ x \ u 都期望使用十六进制数,因此您需要使用 \ u2014 来匹配破折号,而不是 \ u8211 (对于普通的连字符等,则不是 \ x2D .).

\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).

但是为什么不简单地使用Unicode属性"Dash标点符号"呢?

But why not simply use the Unicode property "Dash punctuation"?

作为Java字符串:"\\ s \\ p {Pd} \\ s"

As a Java string: "\\s\\p{Pd}\\s"

这篇关于在Java正则表达式中匹配Unicode破折号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆