为什么String.replaceAll()在java中需要4个斜杠“\\\\”在正则表达式中以实际替换“\”? [英] Why String.replaceAll() in java requires 4 slashes "\\\\" in regex to actually replace "\"?

查看:285
本文介绍了为什么String.replaceAll()在java中需要4个斜杠“\\\\”在正则表达式中以实际替换“\”?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近注意到,当涉及到转义字符\(斜杠)时,String.replaceAll(regex,replacement)的行为非常奇怪。



例如,考虑有一个包含文件路径的字符串 - String text =E:\\\dummypath
,我们要替换 \\/



text.replace(\\,/)给出输出E:/ dummypath text.replaceAll(\\,/)引发异常 java.util.regex.PatternSyntaxException 。 / p>

如果要使用 replaceAll()实现相同的功能,我们需要将其写为
text.replaceAll(\\\\\,/)



一个显着的差异是 replaceAll()其参数为reg-ex,而 replace()具有参数character-sequence!



但是 text.replaceAll(\\\
,/)
与其等效的字符序列完全相同 text.replace(\\\
,/)



深度挖掘
当我们尝试其他一些输入时,可以观察到更奇怪的行为。



分配 text =Hello\\\
World\\\



现在,
text.replaceAll(\\\
,/)
text.replaceAll(\\,/) text.replaceAll(\\\
,/)
所有这三个给出相同的输出 Hello /世界/



Java真的搞砸了reg-ex,以最好的方式感觉!没有其他语言似乎没有这些俏皮的行为。任何具体的原因,为什么Java搞乱了?

解决方案

@Peter Lawrey的答案描述了机制。 问题是反斜杠是两个Java字符串文字和正则表达式的迷你语言中的转义字符。所以当你使用一个字符串文字表示一个正则表达式时,有两组转义要考虑...取决于你想要正则表达式的意思。



但是为什么是这样吗?



这是一件历史性的事情。 Java最初没有正则表达式。 Java字符串文字的语法规则是从C / C ++借来的,它也没有内置的正则表达式支持。直到他们在Java 1.4中以 Pattern 类...的形式添加正则表达式支持之后,Java中的尴尬才显现出来。



那么其他语言如何设法避免这种情况呢?



他们通过在正则表达式中为正则表达式提供直接或间接句法支持编程语言本身。例如,在Perl,Ruby,Javascript和许多其他语言中,存在不适用字符串文字转义规则的pattern / regex(例如'/ pattern /')的语法。在C#和Python中,它们提供了一个替代的原始字符串文字语法,其中反斜杠不转义。 (但请注意,如果您使用正常的C#/ Python字符串语法,则会遇到双重转义的Java问题。)







为什么 text.replaceAll(\\\
,/)
text.replaceAll(\\ \\
,/)
text.replaceAll(\\\,/)都给出相同的输出?


第一种情况是String级别的换行符。 Java正则表达式语言将所有非特殊字符视为匹配自己。



第二种情况是在字符串级别后面带有n的反斜杠。 Java正则表达式语言解释反斜杠后跟n作为换行符。



最后一种情况是反斜杠,后面是字符串级别的换行符号。 Java正则表达式语言不会将其识别为特定(正则表达式)转义序列。然而,任何非字母字符之后的反斜杠意味着后一个字符。所以,反斜杠后跟换行符...意味着与换行相同的东西。


I recently noticed that, String.replaceAll(regex,replacement) behaves very weirdly when it comes to the escape-character "\"(slash)

For example consider there is a string with filepath - String text = "E:\\dummypath" and we want to replace the "\\" with "/".

text.replace("\\","/") gives the output "E:/dummypath" whereas text.replaceAll("\\","/") raises the exception java.util.regex.PatternSyntaxException.

If we want to implement the same functionality with replaceAll() we need to write it as, text.replaceAll("\\\\","/")

One notable difference is replaceAll() has its arguments as reg-ex whereas replace() has arguments character-sequence!

But text.replaceAll("\n","/") works exactly the same as its char-sequence equivalent text.replace("\n","/")

Digging Deeper: Even more weird behaviors can be observed when we try some other inputs.

Lets assign text="Hello\nWorld\n"

Now, text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") all these three gives the same output Hello/World/

Java had really messed up with the reg-ex in its best possible way I feel! No other language seems to have these playful behaviors in reg-ex. Any specific reason, why Java messed up like this?

解决方案

@Peter Lawrey's answer describes the mechanics. The "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.

But why is it like that?

It is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern class ... in Java 1.4.

So how do other languages manage to avoid this?

They do it by providing direct or indirect syntactic support for regexes in the programming language itself. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax in which backslashes are not escapes. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)


Why do text.replaceAll("\n","/"), text.replaceAll("\\n","/"), and text.replaceAll("\\\n","/") all give the same output?

The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.

The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.

The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.

这篇关于为什么String.replaceAll()在java中需要4个斜杠“\\\\”在正则表达式中以实际替换“\”?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆