使用应用程序语言分割出现次数可变的字符串(最好使用批处理脚本) [英] Split string with variable number of occurances using an application language (Batch script preferably)

查看:65
本文介绍了使用应用程序语言分割出现次数可变的字符串(最好使用批处理脚本)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其中包含冒号分隔的行,如下所示:

I have a text file containing colon separated lines such as the following:

OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                       
OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46

请注意,并非所有行都具有相同数量的术语.我希望每一行看起来都像第一行,即只有七个词.对于越过的线,应形成新的线.新行分隔符是O&-,其中&只能是ZK.因此,上面的预期输出是:

Note carefully that not all lines have the same number of terms. I want each line to appear like the first one, namely with seven terms only. For lines that run over, a new line should be formed. New line delimiter is O&- where & can be Z or K only. So the expected output from the above is:

OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                       
OK-10:Mike:James:ID No:00000001:male:my notes
OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2
OK-09:Helen:Rick:ID No:00000005:female:my notes3
OZ-10:Jane:James:ID No:00000034:female:my notes23
OK-09:Mary:Jane:ID No:00000023:female:my notes46

有人可以建议使用文本编辑工具,正则表达式还是应用程序语言(最好是批处理脚本,Java或Python)来实现此目的的方法?

Can someone suggest a way of doing this using a text editing tool, regex, or maybe an application language such as (preferably) Batch script, Java or Python?

UPDATE

UPDATE

我尝试使用python和答案中提供的regex代码:

I tried using python and the regex code provided in the answer:

导入csv 汇入

with open('form.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', row[29])
        print(matches)

但是,如果一个单元格包含多个条目,如:

But if a cell contains multiple entries like :

OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes

它仅返回其中的第一个.

It returns only the first one of them.

推荐答案

这里是Python中基于正则表达式的解决方案,效果很好:

Here is a regex based solution in Python which seems to work well:

with open('form.csv', 'r') as file:
    inp = file.read().replace('\n', '')

matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', inp)
print(matches)

此打印:

['OK-10:Mike:James:ID No:00000001:male:my notes',
 'OK-08:Michael:Knight:ID No:00000004:male:my notes2',
 'OK-09:Helen:Rick:ID No:00000005:female:my notes3',
 'OZ-10:Jane:James:ID No:00000034:female:my notes23',
 'OK-09:Mary:Jane:ID No:00000023:female:my notes46']

以下是有关正则表达式模式如何工作的简短摘要:

Here is a brief summary of how the regex pattern works:

O[KZ]-\d+:      match the first OK/OZ-number term
(?:[^:]+:){5}   then match the next five : terms
.*?(?= O[KZ]|$) finally match the remaining sixth term
                until seeing either OK/OZ or the end of the input

我的脚本生成的输出是一个列表,您可以将其写回到文本文件中,以便以后导入MySQL.请注意,我们在开始时将整个文件读入单个字符串变量.要使用这种正则表达式方法,这是必需的.

The output my script generates is a list, which you may then write back out to a text file, to later import into MySQL. Note that we read the entire file into a single string variable at the beginning. This is necessary to use this regex approach.

这篇关于使用应用程序语言分割出现次数可变的字符串(最好使用批处理脚本)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆