Youtube完整的Java Regex [英] Youtube complete Java Regex

查看:151
本文介绍了Youtube完整的Java Regex的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析几个页面来获取他们所有的Youtube ID。

I need to parse several pages to get all of their Youtube IDs.

我在网上发现了很多正则表达式,但是:Java表达不完整(除了ID之外,他们要么给我垃圾,要么他们错过了一些ID。

I found many regular expressions on the web, but : the Java ones are not complete (they either give me garbage in addition to the IDs, or they miss some IDs).

我发现的那个似乎完整的是here 。但它是用JavaScript和PHP编写的。不幸的是我无法将它们翻译成JAVA。

The one that I found that seems to be complete is hosted here. But it is written in JavaScript and PHP. Unfortunately I couldn't translate them into JAVA.

有人可以帮我用Java重写这个PHP正则表达式或以下的JavaScript吗?

Can somebody help me rewrite this PHP regex or the following JavaScript one in Java?

'~
    https?://         # Required scheme. Either http or https.
    (?:[0-9A-Z-]+\.)? # Optional subdomain.
    (?:               # Group host alternatives.
      youtu\.be/      # Either youtu.be,
    | youtube\.com    # or youtube.com followed by
      \S*             # Allow anything up to VIDEO_ID,
      [^\w\-\s]       # but char before ID is non-ID char.
    )                 # End host alternatives.
    ([\w\-]{11})      # $1: VIDEO_ID is exactly 11 chars.
    (?=[^\w\-]|$)     # Assert next char is non-ID or EOS.
    (?!               # Assert URL is not pre-linked.
      [?=&+%\w]*      # Allow URL (query) remainder.
      (?:             # Group pre-linked alternatives.
        [\'"][^<>]*>  # Either inside a start tag,
      | </a>          # or inside <a> element text contents.
      )               # End recognized pre-linked alts.
    )                 # End negative lookahead assertion.
    [?=&+%\w]*        # Consume any URL (query) remainder.
    ~ix'



/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube\.com\S*[^\w\-\s])([\w\-]{11})(?=[^\w\-]|$)(?![?=&+%\w]*(?:['"][^<>]*>|<\/a>))[?=&+%\w]*/ig;


推荐答案

首先你需要插入额外的反斜杠 \ 旧的正则表达式中的foreach反斜杠,否则java认为你逃脱了字符串中的其他特殊字符,你没有这样做。

First of all you need to insert and extra backslash \ foreach backslash in the old regex, else java thinks you escapes some other special characters in the string, which you are not doing.

https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*

接下来编译模式时,需要添加 CASE_INSENSITIVE flag。这是一个例子:

Next when you compile your pattern you need to add the CASE_INSENSITIVE flag. Here's an example:

String pattern = "https?:\\/\\/(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*";

Pattern compiledPattern = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = compiledPattern.matcher(link);
while(matcher.find()) {
    System.out.println(matcher.group());
}

这篇关于Youtube完整的Java Regex的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆