使用Python解析(修改后的)RIS文件 [英] Parsing a (modified) RIS file with Python
问题描述
我有一堆(已修改) RIS 文件.玩具示例如下所示:
I have a bunch of (modified) RIS files. The toy example looks like the following:
Record #1 of 2
ID: CN-01160769
AU: Uedo N
AU: Kasiser R
TI: Development of an E-learning system
SO: United European Gastroenterology Journal
YR: 2015
Record #2 of 2
ID: CN-01070265
AU: Krogh LQ
TI: E-learning in pediatric basic life support
SO: Resuscitation
YR: 2015
简而言之,每条记录均以Record #
行开头,并以两条空行结束.任务是解析文件并提取标签和字段.
In brief, each record starts with Record #
line and ends with two blank lines. The task is to parse the file and extract tags and fields.
下面粘贴的是我当前的代码(改编自此处):
Pasted below is my current code (adapted from here):
import re
class RIS:
""" RIS file structure """
def __init__(self, in_file=None):
""" Initialize and parse input """
self.records = []
if in_file:
self.parse(in_file)
def parse(self, in_file):
""" Parse input file """
self.current_tag = None
self.current_record = None
prog = re.compile("^([A-Z][A-Z0-9]): (.*)")
lines = []
# Eliminate blank lines
for line in in_file:
line = line.strip()
if len(line) > 0:
lines.append(line)
for line in lines:
match = prog.match(line)
if match:
tag = match.groups()[0]
field = match.groups()[1]
self.process_field(tag, field)
else:
raise ValueError(line)
def process_field(self, tag, field):
""" Process RIS file field """
if tag == "ID":
self.current_record = {tag: field}
elif tag == "YR":
self.records.append(self.current_record)
self.current_record = None
elif tag in ["AU", "AD"]:
if tag in self.current_record:
self.current_record[tag].append(field)
else:
self.current_record[tag] = [field]
else:
if not tag in self.current_record:
self.current_record[tag] = field
else:
error_str = "Duplicate tag: %s" % tag
raise ValueError(error_str)
def main():
""" Test the code """
import pprint
with open("test.ris", "rt") as ris_file:
ris = RIS(ris_file)
pp = pprint.PrettyPrinter()
pp.pprint(ris.records)
if __name__ == "__main__":
main()
当前代码不起作用,因为它无法识别开始标记(例如,Record 1 of 2
),此外,它也不知道记录在哪里停止.在当前版本的代码中,我添加ID
作为开始标记,并添加YR
作为停止标记.但是,代码退出并显示错误:
The current code doesn't work, because it doesn't recognize the start tag (e.g., Record 1 of 2
) and in addition it doesn't know where the record stops. In the current version of the code I add ID
as a start tag and YR
as stop tag. However, the code exit with the error:
ValueError: Record #1 of 2
任何有关如何正确修改代码的建议都将受到欢迎.
Any suggestions how to properly adapt the code are greatly welcome.
推荐答案
您只需要添加一个判断并打破Record #x of 2
行.
you just need add a judge and break the Record #x of 2
line.
import re
class RIS:
""" RIS file structure """
def __init__(self, in_file=None):
""" Initialize and parse input """
self.records = []
if in_file:
self.parse(in_file)
def parse(self, in_file):
""" Parse input file """
self.current_tag = None
self.current_record = None
prog = re.compile("^([A-Z][A-Z0-9]): (.*)")
lines = []
# Eliminate blank lines
for line in in_file:
line = line.strip()
if len(line) > 0:
lines.append(line)
for line in lines:
if "#" in line:
continue
match = prog.match(line)
if match:
tag = match.groups()[0]
field = match.groups()[1]
self.process_field(tag, field)
else:
raise ValueError(line)
def process_field(self, tag, field):
""" Process RIS file field """
if tag == "ID":
self.current_record = {tag: field}
elif tag == "YR":
self.records.append(self.current_record)
self.current_record = None
elif tag in ["AU", "AD"]:
if tag in self.current_record:
self.current_record[tag].append(field)
else:
self.current_record[tag] = [field]
else:
if not tag in self.current_record:
self.current_record[tag] = field
else:
error_str = "Duplicate tag: %s" % tag
raise ValueError(error_str)
def main():
""" Test the code """
import pprint
with open("test.ris", "rt") as ris_file:
ris = RIS(ris_file)
pp = pprint.PrettyPrinter()
pp.pprint(ris.records)
if __name__ == "__main__":
main()
添加代码:
if "#" in line:
continue
输出为
[{'AU': ['Uedo N', 'Kasiser R'],
'ID': 'CN-01160769',
'SO': 'United European Gastroenterology Journal',
'TI': 'Development of an E-learning system'},
{'AU': ['Krogh LQ'],
'ID': 'CN-01070265',
'SO': 'Resuscitation',
'TI': 'E-learning in pediatric basic life support'}]
这篇关于使用Python解析(修改后的)RIS文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!