有没有一种方法可以清除"UTF-8中的无效字节序列"文件,以便清除文件. Ruby中的错误? [英] Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

查看:249
本文介绍了有没有一种方法可以清除"UTF-8中的无效字节序列"文件,以便清除文件. Ruby中的错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一项服务,可以通过客户提供的XML提要将数据上传到我们的数据库中.通常,这些XML文件被声称是UTF-8编码的,但是它们显然具有相当多的无效字节序列.我可以清理这些文件并将它们完美地导入到我们的数据库中,只需在导入之前运行以下Linux命令即可:

I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:

tr -cd '^[:print:]' < original.xml > clean.xml

只需运行一个Linux命令,我就可以使用Ruby on Rails中的Nokogiri将所有数据导入数据库.

Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.

问题在于我们正在 Heroku 上进行部署,而我无法进行预处理使用Linux命令的文件.上周,我已经在Internet上搜索了针对该问题的基于Ruby on Rails的本机解决方案,但是它们都不起作用.在完成所有尝试的建议之前,这是我的原始代码:

The problem is that we're deploying on Heroku, and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:

data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
  hash = node.element_children.each_with_object(Hash.new) do |e, h|
   h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
   data.push(newrow)
 end
end

在原始文件上运行此命令会产生错误: "UTF-8中的无效字节序列"

Running this on the raw file produces an error: "Invalid byte sequence in UTF-8"

以下是我尝试过的所有有用建议,但都失败了.

Here are all the helpful suggestions I've tried but all have failed.

  1. 使用编码器

  1. Use Coder

Coder.clean!(data_string, "UTF-8")

  • 强制编码

  • Force Encoding

    data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
    

  • 转换为UTF-16并返回到UTF-8

  • Convert to UTF-16 and back to UTF-8

    data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
    data_string.encode!('UTF-8', 'UTF-16')
    

  • 使用有效编码吗?

  • Use valid_encoding?

    data_string.chars.select{|i| i.valid_encoding?}.join
    

    没有字符被删除;生成无效字节序列"错误.

    No characters are removed; generates "invalid byte sequence" errors.

    在打开文件时指定编码

    我实际上写了一个函数,尝试所有可能的编码,直到它可以打开文件而没有错误并转换为UTF-8(@file_encodings是每种可能的文件编码的数组)为止:

    I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of every possible file encoding):

    @file_encodings.each do |enc|
      print "#{enc}..."
      conv_str = "r:#{enc}:utf-8"
      begin
        data_file = File.open(fname, conv_str)
        data_string = data_file.read
      rescue
        data_file = nil
        data_string = ""
      end
      data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")
    
      unless data_string.blank? print "\n#{enc} detected!\n"
      return data_string
    end
    

    1. 使用Regexp删除不可打印的内容:

    1. Use Regexp to remove non-printables:

    data_string.gsub!(/[^ [:print:]]/,") data_string.gsub!(/[[:cntrl:]&& [^ \ n \ r]]/,")

    data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

    (我还尝试了包含/[^ a-zA-Z0-9〜`!@#$%^& *()-_ = + [{]} \ |;:',<的变体. >/\?]/)

    (I also tried variants including /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\|;:'",<.>/\?]/)

    对于上述全部,结果是相同的……要么发生无效字节序列"错误,要么在仅读取4400行之后将文件中途切断.

    For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.

    所以,为什么Linux的"tr"命令可以完美地工作,而这些建议中的任何一项都不能在Ruby on Rails中完成.

    So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.

    我最终所做的事情非常微妙,但是却完成了工作.我检查了停止Nokogiri(row.last)的每一行,并查找了奇怪的字符.我发现每一个我添加到一个字符类,然后gsub!删除它,就像这样(控制字符不会在这里打印,但是您会明白的):

    What I ended up doing is extremely inelegant, but it gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):

    data_string.gsub!(/[Crazy Control Characters]/,"")
    

    但是我内心的纯粹主义者坚持应该有一个更优雅,更一般的解决方案.

    But the purist in me insists there should be a more elegant, general solution.

    推荐答案

    Ruby 2.1有一个名为

    Ruby 2.1 has a new method called String.scrub which is exactly what you need.

    如果字符串是无效字节序列,则将无效字节替换为 给定替换字符,否则返回self.如果给出了block, 用该块的返回值替换无效字节.

    If the string is invalid byte sequence then replace invalid bytes with given replacement character, else returns self. If block is given, replace invalid bytes with returned value of the block.

    查看文档以获取更多信息.

    Check the documentation for more information.

    这篇关于有没有一种方法可以清除"UTF-8中的无效字节序列"文件,以便清除文件. Ruby中的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆