如何动态识别数据文件中未知的定界符? [英] How do you dynamically identify unknown delimiters in a data file?

查看:164
本文介绍了如何动态识别数据文件中未知的定界符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个输入数据文件。每个都为其中包含的数据使用不同的分隔符。数据文件如下所示:

 apples |香蕉|橙子|葡萄

数据文件有两个如下:

 dime,nickel,penny 

资料档案三个如下:

马牛猪鸡山羊

(列数的变化也是有意的)



我想的是计算非字母字符的数量,并假定最高计数是分隔符。但是,带有非空格分隔符的文件在分隔符之前和之后也有空格,因此空格在所有三个文件上都成功。这是我的代码:

  def count_chars:
valid_seps = ['','|',', 'c'= cnt.get(c,0)'c':'c':'c' )+ 1
return cnt

infile ='pipe.txt'#or'comma.txt'或'space.txt'
records = open(infile,'r' ).read()
print count_chars(records)

它会打印一个字典,可接受的字符。在每种情况下,空间总是胜利,所以我不能依靠它来告诉我分隔符是什么。



但我不能的一个更好的方法来做到这一点。



有任何建议吗?

解决方案

重新使用python,我建议只是调用 re.split 包含所有有效的预期分隔符:

 >> l =大长的空格分隔词列表
>>>>> re.split(r'[,|;] +',l)
['big','long','list','of','space','separated','words']

唯一的问题是如果其中一个文件使用分隔符作为数据的一部分。 p>

如果你必须识别分隔符,你最好的办法是对除空格以外的所有数据进行计数。如果几乎没有出现,那么它可能是空格,否则它是映射的最大值



不幸的是,没有办法确定,你可能有以逗号分隔的空格分隔的数据,或者你可以用分号填充分隔的数据。不总是工作。


I have three input data files. Each uses a different delimiter for the data contained therein. Data file one looks like this:

apples | bananas | oranges | grapes

data file two looks like this:

quarter, dime, nickel, penny

data file three looks like this:

horse cow pig chicken goat

(the change in the number of columns is also intentional)

The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:

def count_chars(s):
    valid_seps=[' ','|',',',';','\t']
    cnt = {}
    for c in s:
        if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
    return cnt

infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)

It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.

But I can't think of a better way to do this.

Any suggestions?

解决方案

If you're using python, I'd suggest just calling re.split on the line with all valid expected separators:

>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']

The only issue would be if one of the files used a separator as part of the data.

If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.

Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.

这篇关于如何动态识别数据文件中未知的定界符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆