Groovy正则表达式非法字符 [英] Groovy Regex illegal Characters

查看:146
本文介绍了Groovy正则表达式非法字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Groovy脚本,可以将格式不正确的数据转换成XML。这部分工作正常,但它也愉快地传递了一些字符,这些字符在XML中不合法。所以我添加了一些代码去除这些问题,这就是问题来自何处。



不编译的代码是这样的:

>

def illegalChars =〜/ [\\\-\\\] | [\\\ -\\\ ] | [\\\-\ u001F] | [\\\-\\\Ÿ] /



我想知道的是,为什么?我在这里做错了什么?我在 http://regexpal.com/ 测试了这个正则表达式,并且它按预期工作,但是我得到了一个错误编译它在Groovy中:


[错误]编译错误
[信息] ---------- -------------------------------------------------- ------------
[INFO] line 23:26:unexpected char:0x0


上面的行是第23行。周围的线条只是变量声明,我在处理正则表达式时没有改变。

谢谢!

更新:
代码编译,但它没有像我期望的那样过滤。
在正则表达式中,我把正则表达式:


[\\\-\\\\\\ -\\\ \

和测试数据:

  name ='lang'> E< / field>< field name ='title'>化学免疫学与变态反应< / field>< / doc> 
< doc>< field name ='page'> 72-88< / field>< field name ='shm'> 3146.757500< / field>< field
name = 'pubc'> 47< / field>< field name ='cs'> 1< / field>< field name ='issue'> NUMBER< / field>
< field name ='auth'> Dvorak,A.< / field>< field name ='pub'> KARGER< / field>< field
name ='rr' > GBP013.51< / field>< field name ='issn'> 1660-2242< / field>< field
name ='class1'> TS< / field>< field name = 'freq'> S< / field>< field
name ='class2'> 616.079< / field>< field name ='text'>
的亚细胞定位细胞因子,碱性成纤维细胞生长因子和肿瘤坏死因子 - 在肥大细胞
细胞中< / field>< field name ='id> RN170369808< / field>< field name ='volume'> VOL 85< / field> ;
< field name ='year'> 2005< / field>< field name ='lang'> E< / field>< field
name ='title'> CHEMICAL IMMUNOLOGY AND ALLERGY< / field>< / doc>< doc>< field
name ='page'> 89-97< / field>< field name ='shm'> 3146.757500< /字段>< field
name ='pubc'> 47< / field>< field name ='cs'> 1< / field>< field

这是一个带有非法字符的文件,因此它有点随意。但regexpal只突出显示非法字符,但在Groovy中它甚至用空字符串替换'&'和'>'字符,所以它基本上消灭了整个文档。



代码片段:

  def列表parseFile(文件文件){
println读取文件名:$ {file .name}
def lineCount = 0
List data = new ArrayList()

file.eachLine {
String input - >
lineCount ++
String line =输入
if(input =〜illegalChars){
line = input.replaceAll(illegalChars,)
}
Map document = new HashMap()
elementNames.each(){
token - >
def val = getValue(line,token)
if(val!= null){
if(token.equals(ISSUE)){
List entries = val。 split(;)
document.putAt(year,entries.getAt(0).trim())
if(entries.size()> 1){
document .putAt(volume,entries.getAt(1).trim())
}
if(entries.size()> 2){
document.putAt(issue ,entries.getAt(2).trim())
}
} else {
document.putAt(token,val)
}
}
}
data.add(document)
}

printlndone
返回数据
}

我没有看到任何理由说这两个人的表现不一样;我错过了什么?



再次感谢!

确定这是我的发现:

 >>>打印XYZ.replaceAll(
/ [\\\\\\\\\\\\\ 000000 '
-


---

>>>打印X \0YZ.replaceAll(
/ [\\\- \\\\\\ \\\ \\\-\\\\\\-\\\Ÿ] /,
-


X-YZ

>>>打印X \0YZ.replaceAll(
[\\\\\\\\\\\\ 000000 \\\-\\\\Ÿ],
-


X-YZ
$ b

换句话说,我的 \\uNNNN / pattern / 是错误的。会发生什么是 0- \ 成为范围的一部分,并且这包括< ,<$ c $ \\uNNNN 和所有大写字母。



只适用于pattern,不在 / pattern / 中。



我会根据对这个答案的评论来编辑我的正式答案。

>
  • 如何逃避Unicode转义Groovy的/模式/语法


  • I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.

    The code that isn't compiling is this:

    def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/

    What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:

    [ERROR] BUILD ERROR [INFO] ------------------------------------------------------------------------ [INFO] line 23:26: unexpected char: 0x0

    The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.

    Thanks!

    Update: The code compiles, but it's not filtering as I'd expected it to. In regexpal I put the regex:

    [\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]

    and the test data:

    name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
    <doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field 
    name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
    <field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field  
     name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field 
    name='class1'>TS</field><field name='freq'>S</field><field 
    name='class2'>616.079</field><field name='text'>Subcellular Localization of the 
    Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast 
    Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
    <field name='year'>2005</field><field name='lang'>E</field><field 
    name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field   
    name='page'>89-97</field><field name='shm'>3146.757500</field><field 
    name='pubc'>47</field><field name='cs'>1</field><field 
    

    It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.

    The code snippet:

        def List parseFile(File file){
        println "reading File name: ${file.name}"
        def lineCount = 0
        List data = new ArrayList()
    
        file.eachLine {
            String input ->
            lineCount ++
            String line = input
            if(input =~ illegalChars){
                line = input.replaceAll(illegalChars, " ")
            }
            Map document = new HashMap()
            elementNames.each(){
                token ->
                def val = getValue(line, token)
                if(val != null){
                    if(token.equals("ISSUE")){
                        List entries = val.split(";")
                        document.putAt("year",entries.getAt(0).trim())
                        if(entries.size() > 1){
                            document.putAt("volume", entries.getAt(1).trim())
                        }
                        if(entries.size() > 2){
                            document.putAt("issue", entries.getAt(2).trim())
                        }
                    } else {
                        document.putAt(token, val)
                    }
                }
            }
            data.add(document)
        }
    
        println "done"
        return data
    }
    

    I don't see any reason that the two should behave differently; am I missing something?

    Again, thanks!

    解决方案

    OK here's my finding:

    >>> print "XYZ".replaceAll(
           /[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
           "-"
        )
    
    ---
    
    >>> print "X\0YZ".replaceAll(
           /[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
           "-"
        )
    
    X-YZ
    
    >>> print "X\0YZ".replaceAll(
           "[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
           "-"
        )
    
    X-YZ
    

    In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.

    The \\uNNNN only works in "pattern", not in /pattern/.

    I will edit my official answer based on comments to this "answer".

    Related questions

    这篇关于Groovy正则表达式非法字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆