有没有一种方法可以清除"UTF-8中的无效字节序列"文件，以便清除文件. Ruby中的错误? [英] Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

查看：249 发布时间：2020/7/13 6:01:32 ruby-on-rails ruby encoding utf-8 encode

本文介绍了有没有一种方法可以清除"UTF-8中的无效字节序列"文件，以便清除文件. Ruby中的错误?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一项服务，可以通过客户提供的XML提要将数据上传到我们的数据库中.通常，这些XML文件被声称是UTF-8编码的，但是它们显然具有相当多的无效字节序列.我可以清理这些文件并将它们完美地导入到我们的数据库中，只需在导入之前运行以下Linux命令即可:

I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:

tr -cd '^[:print:]' < original.xml > clean.xml

只需运行一个Linux命令，我就可以使用Ruby on Rails中的Nokogiri将所有数据导入数据库.

Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.

问题在于我们正在 Heroku 上进行部署，而我无法进行预处理使用Linux命令的文件.上周，我已经在Internet上搜索了针对该问题的基于Ruby on Rails的本机解决方案，但是它们都不起作用.在完成所有尝试的建议之前，这是我的原始代码:

The problem is that we're deploying on Heroku, and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:

data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
  hash = node.element_children.each_with_object(Hash.new) do |e, h|
   h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
   data.push(newrow)
 end
end

在原始文件上运行此命令会产生错误: "UTF-8中的无效字节序列"

Running this on the raw file produces an error: "Invalid byte sequence in UTF-8"

以下是我尝试过的所有有用建议，但都失败了.

Here are all the helpful suggestions I've tried but all have failed.

使用编码器

Use Coder

Coder.clean!(data_string, "UTF-8")

强制编码

Force Encoding

data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

转换为UTF-16并返回到UTF-8

Convert to UTF-16 and back to UTF-8

data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
data_string.encode!('UTF-8', 'UTF-16')

使用有效编码吗?

Use valid_encoding?

data_string.chars.select{|i| i.valid_encoding?}.join

没有字符被删除；生成无效字节序列"错误.

No characters are removed; generates "invalid byte sequence" errors.

在打开文件时指定编码

我实际上写了一个函数，尝试所有可能的编码，直到它可以打开文件而没有错误并转换为UTF-8(@file_encodings是每种可能的文件编码的数组)为止:

I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of every possible file encoding):

@file_encodings.each do |enc|
  print "#{enc}..."
  conv_str = "r:#{enc}:utf-8"
  begin
    data_file = File.open(fname, conv_str)
    data_string = data_file.read
  rescue
    data_file = nil
    data_string = ""
  end
  data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")

  unless data_string.blank? print "\n#{enc} detected!\n"
  return data_string
end

使用Regexp删除不可打印的内容:

Use Regexp to remove non-printables:

data_string.gsub！(/[^ [:print:]]/，") data_string.gsub！(/[[:cntrl:]&& [^ \ n \ r]]/，")

data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

(我还尝试了包含/[^ a-zA-Z0-9〜`！@#$％^& *()-_ = + [{]} \ |;:'，<的变体. >/\?]/)

(I also tried variants including /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\|;:'",<.>/\?]/)

对于上述全部，结果是相同的……要么发生无效字节序列"错误，要么在仅读取4400行之后将文件中途切断.

For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.

所以，为什么Linux的"tr"命令可以完美地工作，而这些建议中的任何一项都不能在Ruby on Rails中完成.

So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.

我最终所做的事情非常微妙，但是却完成了工作.我检查了停止Nokogiri(row.last)的每一行，并查找了奇怪的字符.我发现每一个我添加到一个字符类，然后gsub！删除它，就像这样(控制字符不会在这里打印，但是您会明白的):

What I ended up doing is extremely inelegant, but it gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):

data_string.gsub!(/[Crazy Control Characters]/,"")

但是我内心的纯粹主义者坚持应该有一个更优雅，更一般的解决方案.

But the purist in me insists there should be a more elegant, general solution.

有没有一种方法可以清除"UTF-8中的无效字节序列"文件，以便清除文件. Ruby中的错误? [英] Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有没有一种方法可以清除"UTF-8中的无效字节序列"文件，以便清除文件. Ruby中的错误? [英] Is there a way to clean a file of &quot;invalid byte sequence in UTF-8&quot; errors in Ruby?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

有没有一种方法可以清除"UTF-8中的无效字节序列"文件，以便清除文件. Ruby中的错误? [英] Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

登录关闭