使用应用程序语言分割出现次数可变的字符串(最好使用批处理脚本) [英] Split string with variable number of occurances using an application language (Batch script preferably)
问题描述
我有一个文本文件,其中包含冒号分隔的行,如下所示:
I have a text file containing colon separated lines such as the following:
OK-10:Jason:Jones:ID No:00000000:male:my notes
OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46
请注意,并非所有行都具有相同数量的术语.我希望每一行看起来都像第一行,即只有七个词.对于越过的线,应形成新的线.新行分隔符是O&-
,其中&
只能是Z
或K
.因此,上面的预期输出是:
Note carefully that not all lines have the same number of terms. I want each line to appear like the first one, namely with seven terms only. For lines that run over, a new line should be formed. New line delimiter is O&-
where &
can be Z
or K
only. So the expected output from the above is:
OK-10:Jason:Jones:ID No:00000000:male:my notes
OK-10:Mike:James:ID No:00000001:male:my notes
OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2
OK-09:Helen:Rick:ID No:00000005:female:my notes3
OZ-10:Jane:James:ID No:00000034:female:my notes23
OK-09:Mary:Jane:ID No:00000023:female:my notes46
有人可以建议使用文本编辑工具,正则表达式还是应用程序语言(最好是批处理脚本,Java或Python)来实现此目的的方法?
Can someone suggest a way of doing this using a text editing tool, regex, or maybe an application language such as (preferably) Batch script, Java or Python?
UPDATE
UPDATE
我尝试使用python和答案中提供的regex代码:
I tried using python and the regex code provided in the answer:
导入csv 汇入
with open('form.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', row[29])
print(matches)
但是,如果一个单元格包含多个条目,如:
But if a cell contains multiple entries like :
OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes
它仅返回其中的第一个.
It returns only the first one of them.
推荐答案
这里是Python中基于正则表达式的解决方案,效果很好:
Here is a regex based solution in Python which seems to work well:
with open('form.csv', 'r') as file:
inp = file.read().replace('\n', '')
matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', inp)
print(matches)
此打印:
['OK-10:Mike:James:ID No:00000001:male:my notes',
'OK-08:Michael:Knight:ID No:00000004:male:my notes2',
'OK-09:Helen:Rick:ID No:00000005:female:my notes3',
'OZ-10:Jane:James:ID No:00000034:female:my notes23',
'OK-09:Mary:Jane:ID No:00000023:female:my notes46']
以下是有关正则表达式模式如何工作的简短摘要:
Here is a brief summary of how the regex pattern works:
O[KZ]-\d+: match the first OK/OZ-number term
(?:[^:]+:){5} then match the next five : terms
.*?(?= O[KZ]|$) finally match the remaining sixth term
until seeing either OK/OZ or the end of the input
我的脚本生成的输出是一个列表,您可以将其写回到文本文件中,以便以后导入MySQL.请注意,我们在开始时将整个文件读入单个字符串变量.要使用这种正则表达式方法,这是必需的.
The output my script generates is a list, which you may then write back out to a text file, to later import into MySQL. Note that we read the entire file into a single string variable at the beginning. This is necessary to use this regex approach.
这篇关于使用应用程序语言分割出现次数可变的字符串(最好使用批处理脚本)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!