使用Perl正则表达式确定URI是否有效 [英] Determining if a URI is valid using Perl regex

查看:33
本文介绍了使用Perl正则表达式确定URI是否有效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我正在开发的应用程序,我需要一个Perl脚本,该脚本循环遍历庞大的CSV文件,并确保每一行都包含有效的URI.我之前已经问过一个有关解析CSV文件的问题,并且我已经开始使用 Text :: CSV 来简化我的生活.现在,我要确保URI有效.

For an application I'm developing I need a Perl script which loops through a massive CSV file and ensures that every single line contains a valid URI. I already asked a question earlier about parsing a CSV file and I have started using Text::CSV to make my life a lot easier. Now I have the issue of ensuring that the URI is valid.

由于我的应用程序的性质,URI不需要采用完整的形式

Due to the nature of my application, URIs do not need to take the full form of

protocol://username:password@domain.extension/request?vars=values

相反,我只对请求部分感兴趣.对于一般网站,该名称可以是 .com .edu 等之后的任何内容.

Rather I am only interested in the request portion of this. For a general website, that would be anything after the .com, .edu, etc.

我目前有以下Perl脚本:

I currently have the following Perl script:

if($_ !~ /^(?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*$/i){
    print "Invalid URL format";
    exit;
} else {
    /* stuff */
}

正则表达式应该相当简单.允许该请求包含一小组符号中的任何一个( [a-z0-9 -._〜!$&'()* +,; =:/?@] )或者它可以包含一个百分号(),后跟两个十六进制数字.这些模式中的任何一个都可以无限期地重复.

The regex should be fairly straight-forward. The request is allowed to contain either one of a small set of symbols ([a-z0-9-._~!$&'()*+,;=:/?@]) or it may contain a percent sign (%) followed by two hexadecimal digits. Either of these patterns may be repeated indefinitely.

运行此脚本时,出现以下错误:

When I run this script I get the following error:

Number found where operator expected at ./301rules.pl line 58, near "%[0"
        (Missing operator before 0?)
Bareword found where operator expected at ./301rules.pl line 58, near "9A"
        (Missing operator before A?)
Bareword found where operator expected at ./301rules.pl line 58, near "$/i"
        (Missing operator before i?)
syntax error at ./301rules.pl line 58, near "%[0"

很明显,我的正则表达式中的某些内容需要转义,但是我不确定.我尝试转义每个可能的符号来创建以下正则表达式:

It's fairly obvious that something in my regex needs to be escaped, however I'm unsure of what. I tried escaping every possible symbol to create the following regex:

if($_ !~ /^(?:[a-z0-9\-\.\_\~\!\$\&\'\(\)\*\+\,\;\=\:\/\?\@]|%[0-9A-F]{2})*$/i){

但是,当我这样做时,它只是允许每个字符串通过测试,即使我知道是无效的字符串,例如 te%st é

However when I did this it just allowed every string to pass the test, even strings which I knew are invalid such as te%st or é

那么,有没有人有使用Perl regex的经验,知道我需要逃避什么,我不应该逃避什么?使用19种不同的符号,我不想尝试所有2 ^ 19 = 524288种可能性.

So does anyone have experience with Perl regex and know what I need to escape and what I should not escape? With 19 different symbols I don't feel like trying all 2^19 = 524288 possibilities.

编辑-投票关闭.我发现问题确实存在于此循环的上方,尽管我还不完全理解为什么.

EDIT - voting to close. I found out that the issue actually existed immediately above this loop, although I don't entirely understand why yet.

我有:

if( $_ == "" ){
    next;
}
/* regex conditional from above */

无论出于何种原因,尽管显然已经将数据存储在 $ _ 中,它仍会继续执行true并继续下一次迭代.我会弄清楚为什么会这样,但是就现在而言,正则表达式可以在所有逃脱的情况下正常工作.

For whatever reason it kept evaluating to true and going to the next iteration despite there clearly being data stored in $_. I'll figure out why this was, but for now the regex works fine with everything escaped.

推荐答案

我不知道您如何使用第一个正则表达式,但是我会尽力帮助您解决该问题.您只需要转义正则表达式中具有特殊含义的字符-从您的正则表达式中,它们是:-,.,$,(,),*,/,因此正则表达式应如下所示:

I don't know how you got to your first regex, but I'll try helping you fix that. You only have to escape the characters that have special meaning in regex - from your regex, they are: -,.,$,(,),*,/, so the regex should look like:

if($_ !~ /^(?:[a-z0-9\-\._~!\$&'\(\)\*+,;=:\/?@]|%[0-9A-F]{2})*$/i){

我不完全知道?:到底想达到什么目的,但是您紧随其后的第一个字符类(第一个 [] 之间的表达式)没有乘数-也许后面应该跟着*,a +或?.另外,我认为 | 符号用于在您的第一个字符类和第二个字符类之间加上或-现在看起来,它仅在第一个字符类和符号之间进行.可能应该像 |(%[0-9A-F] {2}))* $

I don't exactly know what ?: is trying to achieve there, but your first character class that is just following it (the expression between the first [] ) is not having any multipliers - maybe it should be followed by a *, a +, or a ?. Also, the | sign I think is meant to do the or between your first character class and the second character class preceded by a % - as it looks right now, it does it beteween the first character class and the % sign only. It probably should be like |(%[0-9A-F]{2}))*$

这篇关于使用Perl正则表达式确定URI是否有效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆