Ruby 1.9:输入编码未知的正则表达式 [英] Ruby 1.9: Regular Expressions with unknown input encoding

查看:29
本文介绍了Ruby 1.9:输入编码未知的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Ruby 1.9 中是否有一种可接受的方法来处理输入编码未知的正则表达式?假设我的输入恰好是 UTF-16 编码:

x = "foo

bar

baz"y = x.encode('UTF-16LE')re =/

(.*)

/x.match(re)=>#<MatchData "<p>bar</p>"1:酒吧">y.match(re)编码::兼容性错误:不兼容的编码正则表达式匹配(带有 UTF-16LE 字符串的 US-ASCII 正则表达式)

我目前的方法是在内部使用 UTF-8 并在必要时重新编码(副本)输入:

if y.methods.include?(:encode) # Ruby 1.8 兼容性如果 y.encoding.name != 'UTF-8'y = y.encode('UTF-8')结尾结尾y.match(/

(.*)

/u)=>#<MatchData "<p>bar</p>"1:酒吧">

不过,这让我觉得有点别扭,想问问有没有更好的方法.

解决方案

据我所知,没有更好的方法可以使用.但是,我可以建议稍微改动一下吗?

与其改变输入的编码,不如改变正则表达式的编码?每次遇到新的编码时翻译一个正则表达式字符串比翻译成百上千行输入以匹配正则表达式的编码要少得多.

# 使正则表达式转码更简单的实用函数.def get_regex(pattern, encoding='ASCII', options=0)Regexp.new(pattern.encode(encoding),options)结尾# 内部代码循环输入行.# 变量 'regex' 和 'line_encoding' 应该事先初始化,以便# 跨循环持久化.if line.methods.include?(:encoding) # Ruby 1.8 兼容性如果 line.encoding != last_encodingregex = get_regex('<p>(.*)</p>',line.encoding,16) #//u = 00010000 选项位集 = 16last_encoding = line.encoding结尾结尾line.match(正则表达式)

在病理情况下(输入编码改变每一行)这将同样缓慢,因为您每次都在循环中重新编码正则表达式.但在 99.9% 的情况下,对于数百或数千行的整个文件,编码是恒定的,这将导致重新编码的大量减少.

Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let's say my input happens to be UTF-16 encoded:

x  = "foo<p>bar</p>baz"
y  = x.encode('UTF-16LE')
re = /<p>(.*)</p>/

x.match(re) 
=> #<MatchData "<p>bar</p>" 1:"bar">

y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)

My current approach is to use UTF-8 internally and re-encode (a copy of) the input if necessary:

if y.methods.include?(:encode)  # Ruby 1.8 compatibility
  if y.encoding.name != 'UTF-8'
    y = y.encode('UTF-8')
  end
end

y.match(/<p>(.*)</p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">

However, this feels a little awkward to me, and I wanted to ask if there's a better way to do it.

解决方案

As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?

Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.

# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end



  # Inside code looping through lines of input.
  # The variables 'regex' and 'line_encoding' should be initialized previously, to
  # persist across loops.
  if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
    if line.encoding != last_encoding
      regex = get_regex('<p>(.*)</p>',line.encoding,16) # //u = 00010000 option bit set = 16
      last_encoding = line.encoding
    end
  end
  line.match(regex)

In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.

这篇关于Ruby 1.9:输入编码未知的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆