使用Python解析(修改后的)RIS文件 [英] Parsing a (modified) RIS file with Python

查看:187
本文介绍了使用Python解析(修改后的)RIS文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆(已修改) RIS 文件.玩具示例如下所示:

I have a bunch of (modified) RIS files. The toy example looks like the following:

Record #1 of 2
ID: CN-01160769
AU: Uedo N
AU: Kasiser R
TI: Development of an E-learning system
SO: United European Gastroenterology Journal
YR: 2015


Record #2 of 2
ID: CN-01070265
AU: Krogh LQ
TI: E-learning in pediatric basic life support
SO: Resuscitation
YR: 2015

简而言之,每条记录均以Record #行开头,并以两条空行结束.任务是解析文件并提取标签和字段.

In brief, each record starts with Record # line and ends with two blank lines. The task is to parse the file and extract tags and fields.

下面粘贴的是我当前的代码(改编自此处):

Pasted below is my current code (adapted from here):

import re

class RIS:
    """ RIS file structure """
    def __init__(self, in_file=None):
        """ Initialize and parse input """
        self.records = []
        if in_file:
            self.parse(in_file)

    def parse(self, in_file):
        """ Parse input file """
        self.current_tag = None
        self.current_record = None
        prog = re.compile("^([A-Z][A-Z0-9]): (.*)")
        lines = []
        # Eliminate blank lines
        for line in in_file:
            line = line.strip()
            if len(line) > 0:
                lines.append(line)
        for line in lines:
            match = prog.match(line)
            if match:
                tag = match.groups()[0]
                field = match.groups()[1]
                self.process_field(tag, field)
            else:
                raise ValueError(line)

    def process_field(self, tag, field):
        """ Process RIS file field """
        if tag == "ID":
            self.current_record = {tag: field}
        elif tag == "YR":
            self.records.append(self.current_record)
            self.current_record = None
        elif tag in ["AU", "AD"]:
            if tag in self.current_record:
                self.current_record[tag].append(field)
            else:
                self.current_record[tag] = [field]
        else:
            if not tag in self.current_record:
                self.current_record[tag] = field
            else:
                error_str = "Duplicate tag: %s" % tag
                raise ValueError(error_str)

def main():
    """ Test the code """
    import pprint
    with open("test.ris", "rt") as ris_file:
        ris = RIS(ris_file)
        pp = pprint.PrettyPrinter()
        pp.pprint(ris.records)

if __name__ == "__main__":
    main()

当前代码不起作用,因为它无法识别开始标记(例如,Record 1 of 2),此外,它也不知道记录在哪里停止.在当前版本的代码中,我添加ID作为开始标记,并添加YR作为停止标记.但是,代码退出并显示错误:

The current code doesn't work, because it doesn't recognize the start tag (e.g., Record 1 of 2) and in addition it doesn't know where the record stops. In the current version of the code I add ID as a start tag and YR as stop tag. However, the code exit with the error:

ValueError: Record #1 of 2

任何有关如何正确修改代码的建议都将受到欢迎.

Any suggestions how to properly adapt the code are greatly welcome.

推荐答案

您只需要添加一个判断并打破Record #x of 2行.

you just need add a judge and break the Record #x of 2 line.

import re

class RIS:
    """ RIS file structure """
    def __init__(self, in_file=None):
        """ Initialize and parse input """
        self.records = []
        if in_file:
            self.parse(in_file)

    def parse(self, in_file):
        """ Parse input file """
        self.current_tag = None
        self.current_record = None
        prog = re.compile("^([A-Z][A-Z0-9]): (.*)")
        lines = []
        # Eliminate blank lines
        for line in in_file:
            line = line.strip()
            if len(line) > 0:
                lines.append(line)
        for line in lines:
            if "#" in line:
                continue
            match = prog.match(line)
            if match:
                tag = match.groups()[0]
                field = match.groups()[1]
                self.process_field(tag, field)
            else:
                raise ValueError(line)

    def process_field(self, tag, field):
        """ Process RIS file field """
        if tag == "ID":
            self.current_record = {tag: field}
        elif tag == "YR":
            self.records.append(self.current_record)
            self.current_record = None
        elif tag in ["AU", "AD"]:
            if tag in self.current_record:
                self.current_record[tag].append(field)
            else:
                self.current_record[tag] = [field]
        else:
            if not tag in self.current_record:
                self.current_record[tag] = field
            else:
                error_str = "Duplicate tag: %s" % tag
                raise ValueError(error_str)

def main():
    """ Test the code """
    import pprint
    with open("test.ris", "rt") as ris_file:
        ris = RIS(ris_file)
        pp = pprint.PrettyPrinter()
        pp.pprint(ris.records)

if __name__ == "__main__":
    main()

添加代码:

if "#" in line:
    continue

输出为

[{'AU': ['Uedo N', 'Kasiser R'],
  'ID': 'CN-01160769',
  'SO': 'United European Gastroenterology Journal',
  'TI': 'Development of an E-learning system'},
 {'AU': ['Krogh LQ'],
  'ID': 'CN-01070265',
  'SO': 'Resuscitation',
  'TI': 'E-learning in pediatric basic life support'}]

这篇关于使用Python解析(修改后的)RIS文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆