Ruby:如何检测/智能猜测CSV文件中使用的分隔符? [英] Ruby : How can I detect/intelligently guess the delimiter used in a CSV file?
问题描述
我需要能够弄清楚在我的Ruby项目中的csv文件(逗号,空格或分号)中使用了哪个分隔符。我知道,在csv模块中有一个Sniffer类在Python中,可以用来猜测给定文件的分隔符。在Ruby中有类似的东西吗?非常感谢任何类型的帮助或想法。
I need to be able to figure out which delimiter is being used in a csv file (comma, space or semicolon) in my Ruby project. I know, there is a Sniffer class in Python in the csv module that can be used to guess a given file's delimiter. Is there anything similar to this in Ruby ? Any kind of help or idea is greatly appreciated.
推荐答案
看起来像py实现只检查几种方言:excel或excel_tab。因此,只检查,
或\t
的简单实现是:
Looks like the py implementation just checks a few dialects: excel or excel_tab. So, a simple implementation of something that just checks for ","
or "\t"
is:
COMMON_DELIMITERS = ['","',"\"\t\""]
def sniff(path)
first_line = File.open(path).first
return nil unless first_line
snif = {}
COMMON_DELIMITERS.each {|delim|snif[delim]=first_line.count(delim)}
snif = snif.sort {|a,b| b[1]<=>a[1]}
snif.size > 0 ? snif[0][0] : nil
end
注意:分隔符,它找到,例如,
,因此要获得,
,您可以更改 snif [0] [0 ]
到 snif [0] [0] [1]
。
Note: that would return the full delimiter it finds, e.g. ","
, so to get ,
you could change the snif[0][0]
to snif[0][0][1]
.
我使用 count(delim)
,因为它更快一些,但如果添加了一个由两个(或更多)字符组成的分隔符, code> - ,那么当称量类型时,它可能每次出现两次(或更多),因此在这种情况下,最好使用 scan delim).length
。
Also, I'm using count(delim)
because it is a little faster, but if you added a delimiter that is composed of two (or more) characters of the same type like --
, then it would could each occurrence twice (or more) when weighing the type, so in that case, it may be better to use scan(delim).length
.
这篇关于Ruby:如何检测/智能猜测CSV文件中使用的分隔符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!