(Java)RegEx从CSS获取URL? [英] (Java) RegEx to get the URLs from CSS?
问题描述
我正在解析CSS,以便从链接的样式表中获取URL.这是一个Java应用程序. (我尝试使用CSSParser( http://cssparser.sourceforge.net/),它解析时会默默地删除许多规则.)
I'm parsing CSS to get the URLs out of linked style sheets. This is a Java app. (I tried using the CSSParser ( http://cssparser.sourceforge.net/ ), however, it is silently dropping many of the rules when it parses.)
所以我只是在使用正则表达式.我想要一个仅获取URL的正则表达式,并且足够健壮以应对来自野外的真实CSS:
So I'm just using Regex. I'd like a regex that gets me just the URLs, and is robust enough to deal with real css from the wild:
background-image: url('test/test.gif');
background: url("test2/test2.gif");
background-image: url(test3/test3.gif);
background: url ( test4/ test4.gif );
background: url( " test5/test5.gif" );
您明白了.这是在Java的正则表达式实现中(不是我的最爱).
You get the idea. This is in Java's regex implementation (not my favorite).
推荐答案
正则表达式的问题在于它们有时过于严格,超出了您的需要.如果您向我们展示了您当前无法正常工作的正则表达式,我将能够为您提供更多帮助.
The problem with regexes is that they are sometimes too strict than you need. If you shown us your currently non-perfectly-working regex I would have been able to help you more.
第一条评论:浏览器倾向于容忍大多数 HTML/CSS错误(不是JavaScript,这是一种编程语言,而不是标记语言).
First comment: browsers tend to tolerate the majority of HTML/CSS mistakes (NOT JavaScript, which is a programming and not a markup language).
您可以从background(-image)?
令牌开始以锁定第一部分.如何进行?很难...
You could start with the background(-image)?
token to lock the first part. How to proceed? Very difficult...
您总是有冒号,因此您可以将其添加到令牌的常量部分,然后根据示例(而非CSS规范)判断出可变数量的空格,后跟url
令牌.空格的可变数是[\w]*
,这成为我们正则表达式的一部分.
You always have colon, so you can add to the constant part of the token, and then, judging from your example (not from CSS specs) a variable number of whitespaces followed by url
token. A variable number of whitespaces is [\w]*
, and this becomes part of our regex.
我尝试过RegexBuddy
I tried this with RegexBuddy
background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)\);
不幸的是,它捕获了URL内的空格
Unfortunately, it captures whitespaces inside URLs
Matched text: background-image: url('test/test.gif');
Match offset: 0
Match length: 39
Backreference 1: -image
Backreference 1 offset: 10
Backreference 1 length: 6
Backreference 2: 'test/test.gif'
Backreference 2 offset: 22
Backreference 2 length: 15
Matched text: background: url ( test4/ test4.gif );
Match offset: 119
Match length: 39
Backreference 1:
Backreference 1 offset: -1
Backreference 1 length: 0
Backreference 2: test4/ test4.gif
Backreference 2 offset: 138
Backreference 2 length: 18
因此,当您获得带有此URL的URL时,必须修剪字符串.在示例4中,我无法从url
组中排除空格,但是,应该将其中包含空格的URL匹配,并且在本示例中,该示例不应该是正确的您没有%20test4.gif
文件
So, when you get the URL with this you must trim the string. I couldn't exclude whitespaces from url
group as of example 4, which, however, should match a URL with a whitespace in it, and which shouldn't be correct is this examples as soon as you don't have a %20test4.gif
file
我更喜欢以下版本的正则表达式
I prefer the following version of the regex
background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)[\s]*\)[\s]*;
它可以容忍更多空白
这篇关于(Java)RegEx从CSS获取URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!