根据特定键解析文本块中的值 [英] Parse values from a block of text based on specific keys
问题描述
我正在从我无法控制的来源中解析一些文本,这不是很方便的格式.我有这样的话:
I'm parsing some text from a source outside my control, that is not in a very convenient format. I have lines like this:
问题类别:人类努力问题子类别:太空探索问题类型:无法启动软件版本:9.8.77.omni.3问题详细信息:信号屏障室问题.
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.
我想按这样的键来分隔行:
I want to split the line by keys like this:
Problem_Category = "Human Endeavors"
Problem_Subcategory = "Space Exploration"
Problem_Type = "Failure to Launch"
Software_Version = "9.8.77.omni.3"
Problem_Details = "Issue with signal barrier chamber."
键将始终以相同的顺序排列,并且始终以分号结尾,但是值和下一个键之间不一定存在空格或换行符.我不确定可以用作分隔符来解析此内容,因为冒号和空格也可以出现在值中.我如何解析此文本?
The keys will always be in the same order, and are always followed by a semi-colon, but there is not necessarily space or newlines between a value and the next key. I'm not sure what can be used as a delimiter to parse this, since colons and spaces can appear in the values as well. How can I parse this text?
推荐答案
如果您的文本块是此字符串:
If your block of text is this string:
text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'
然后
import re
names = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']
text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'
pat = r'({}):'.format('|'.join(names))
data = dict(zip(*[iter(re.split(pat, text, re.MULTILINE)[1:])]*2))
print(data)
产生命令
{'Problem Category': ' Human Endeavors ',
'Problem Details': ' Issue with signal barrier chamber.',
'Problem Subcategory': ' Space Exploration',
'Problem Type': ' Failure to Launch',
'Software Version': ' 9.8.77.omni.3'}
因此您可以分配
text = df_dict['NOTE_DETAILS'][0]
...
df_dict['NOTE_DETAILS'][0] = data
然后可以使用dict索引访问子类别:
and then you could access the subcategories with dict indexing:
df_dict['NOTE_DETAILS'][0]['Problem_Category']
但是要小心.字典列表的深度嵌套dicts/DataFrames通常是 糟糕的设计.正如 Zen of Python 所说,扁平比嵌套更好.
Caution, though. Deeply nested dicts/DataFrames of lists of dicts is usually a bad design. As the Zen of Python says, Flat is better than nested.
这篇关于根据特定键解析文本块中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!