根据特定键解析文本块中的值 [英] Parse values from a block of text based on specific keys

查看:39
本文介绍了根据特定键解析文本块中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从我无法控制的来源中解析一些文本,这不是很方便的格式.我有这样的话:

I'm parsing some text from a source outside my control, that is not in a very convenient format. I have lines like this:

问题类别:人类努力问题子类别:太空探索问题类型:无法启动软件版本:9.8.77.omni.3问题详细信息:信号屏障室问题.

Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.

我想按这样的键来分隔行:

I want to split the line by keys like this:

Problem_Category = "Human Endeavors"
Problem_Subcategory = "Space Exploration"
Problem_Type = "Failure to Launch"
Software_Version = "9.8.77.omni.3"
Problem_Details = "Issue with signal barrier chamber."

键将始终以相同的顺序排列,并且始终以分号结尾,但是值和下一个键之间不一定存在空格或换行符.我不确定可以用作分隔符来解析此内容,因为冒号和空格也可以出现在值中.我如何解析此文本?

The keys will always be in the same order, and are always followed by a semi-colon, but there is not necessarily space or newlines between a value and the next key. I'm not sure what can be used as a delimiter to parse this, since colons and spaces can appear in the values as well. How can I parse this text?

推荐答案

如果您的文本块是此字符串:

If your block of text is this string:

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'

然后

import re
names = ['Problem Category', 'Problem Subcategory', 'Problem Type', 'Software Version', 'Problem Details']

text = 'Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.'

pat = r'({}):'.format('|'.join(names))
data = dict(zip(*[iter(re.split(pat, text, re.MULTILINE)[1:])]*2))
print(data)

产生命令

{'Problem Category': ' Human Endeavors ',
 'Problem Details': ' Issue with signal barrier chamber.',
 'Problem Subcategory': ' Space Exploration',
 'Problem Type': ' Failure to Launch',
 'Software Version': ' 9.8.77.omni.3'}

因此您可以分配

text = df_dict['NOTE_DETAILS'][0]
...
df_dict['NOTE_DETAILS'][0] = data

然后可以使用dict索引访问子类别:

and then you could access the subcategories with dict indexing:

df_dict['NOTE_DETAILS'][0]['Problem_Category']

但是要小心.字典列表的深度嵌套dicts/DataFrames通常是 糟糕的设计.正如 Zen of Python 所说,扁平比嵌套更好.

Caution, though. Deeply nested dicts/DataFrames of lists of dicts is usually a bad design. As the Zen of Python says, Flat is better than nested.

这篇关于根据特定键解析文本块中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆