Java中的RegEx:如何处理换行符 [英] RegEx in Java: how to deal with newline

查看:272
本文介绍了Java中的RegEx:如何处理换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试学习如何使用正则表达式,所以请忍受我的简单问题.例如,假设我有一个输入文件,其中包含一堆用换行符分隔的链接:

I am currently trying to learn how to use regular expressions so please bear with my simple question. For example, say I have an input file containing a bunch of links separated by a newline:

www.foo.com/Archives/monkeys.htm
Monkey网站的说明.

www.foo.com/Archives/monkeys.htm
Description of Monkey's website.

www.foo.com/Archives/pigs.txt
Pig网站的说明.

www.foo.com/Archives/pigs.txt
Description of Pig's website.

www.foo.com/Archives/kitty.txt
Kitty网站的说明.

www.foo.com/Archives/kitty.txt
Description of Kitty's website.

www.foo.com/Archives/apple.htm
对苹果网站的描述.

www.foo.com/Archives/apple.htm
Description of Apple's website.

如果我想获得一个网站及其描述,则此正则表达式似乎可以在一种测试工具上运行:.*www.*\\s.*Pig.*

If I wanted to get one website along with its description, this regex seems to work on a testing tool: .*www.*\\s.*Pig.*

但是,当我尝试在我的代码中运行它时,它似乎不起作用.这个表达正确吗?我尝试将"\ s"替换为"\ n",但似乎仍然无法正常工作.

However, when I try running it within my code it doesn't seem to work. Is this expression correct? I tried replacing "\s" with "\n" and it doesn't seem to work still.

推荐答案

文件中的行可能用\r\n分隔.在Java正则表达式中,\r(回车符)和\n(换行符)都被视为行分隔符,而.元字符与二者都不匹配. \s将匹配那些字符,因此它消耗了\r,但是留下了.*来匹配\n,这失败了.您的测试人员可能只使用\n来分隔行,而这些行已被\s消耗.

The lines are probably separated by \r\n in your file. Both \r (carriage return) and \n (linefeed) are considered line-separator characters in Java regexes, and the . metacharacter won't match either of them. \s will match those characters, so it consumes the \r, but that leaves .* to match the \n, which fails. Your tester probably used just \n to separate the lines, which was consumed by \s.

如果我是对的,将\s更改为\s+[\r\n]+应该可以使其正常工作.在这种情况下,这可能就是您要做的所有事情,但是有时您必须精确匹配一个行分隔符,或者至少要跟踪要匹配的行数.在这种情况下,您需要一个正则表达式,该正则表达式必须与三种最常见的行分隔符类型之一完全匹配:\r\n(Windows/DOS),\n(Unix/Linus/OSX)和\r(旧Mac).这些都可以做:

If I'm right, changing the \s to \s+ or [\r\n]+ should get it to work. That's probably all you need to do in this case, but sometimes you have to match exactly one line separator, or at least keep track of how many you're matching. In that case you need a regex that matches exactly one of any of the three most common line separator types: \r\n (Windows/DOS), \n (Unix/Linus/OSX) and \r (older Macs). Either of these will do:

\r\n|[\r\n]

\r\n|\n|\r


更新:从Java 8开始,我们还有另一个选择, Unicode标准定义的其他多个行分隔符.等效于此:


Update: As of Java 8 we have another option, \R. It matches any line separator, including not just \r\n, but several others as defined by the Unicode standard. It's equivalent to this:

\r\n|[\n\x0B\x0C\r\u0085\u2028\u2029]

这是您的使用方式:

(?im)^.*www.*\R.*Pig.*$

i选项使其不区分大小写,并且m将其置于多行模式,从而允许^$在行边界处匹配.

The i option makes it case-insensitive, and the m puts it in multiline mode, allowing ^ and $ to match at line boundaries.

这篇关于Java中的RegEx:如何处理换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆