在Java正则表达式中匹配Unicode破折号? [英] Matching Unicode Dashes in Java Regular Expressions?

查看：65 发布时间：2021/5/18 20:27:29 java regex unicode character-properties

本文介绍了在Java正则表达式中匹配Unicode破折号?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试制作一个Java正则表达式，以使用Pattern.split()将常规格式"foo-bar"的字符串拆分为"foo"和"bar".-"字符可能是以下几个破折号之一:ASCII'-'，em-破折号，en-破折号等.我构造了以下正则表达式:

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

，如果我正确阅读了Pattern文档，当在两边都用空格包围时，应该捕获任何Unicode破折号或ASCII破折号.我使用的模式如下:

which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

不高兴.对于以下示例输入，未检测到破折号，并且titleSegmentSeparator.matcher(sectionTitle).find()返回false！

No joy. For the sample input below, the dash is not detected, and titleSegmentSeparator.matcher(sectionTitle).find() returns false!

为了确保我没有丢失任何不寻常的字符实体，我使用System.out打印了一些调试信息.输出如下-每个字符后跟(int)char的输出，该字符应该是其Unicode代码点，不是吗?

In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?

样本输入:

研究摘要(10之1)–竞争

Study Summary (1 of 10) – Competition

S(83)t(116)u(117)d(100)y(121)(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)(32)((40)1(49)(32)o(111)f(102)(32)1(49)0(48))(41)(32)–(8211)(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

在我看来，破折号是代码点8211，应由正则表达式匹配，但事实并非如此！这是怎么回事?

It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

在Java正则表达式中匹配Unicode破折号? [英] Matching Unicode Dashes in Java Regular Expressions?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

在Java正则表达式中匹配Unicode破折号? [英] Matching Unicode Dashes in Java Regular Expressions?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭