如何从asn1数据文件中提取数据并将其加载到数据帧中? [英] How to extract data from asn1 data file and load it into a dataframe?

查看:53
本文介绍了如何从asn1数据文件中提取数据并将其加载到数据帧中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的最终目标是将从 PubMed 接收的元数据加载到pyspark中数据框.到目前为止,我已经成功地使用Shell脚本从PubMed数据库下载了我想要的数据.下载的数据为asn1格式.这是数据输入的示例:

My ultimate goal is to load meta data received from PubMed into a pyspark dataframe. So far, I have managed to download the data I want from the PubMed data base using a shell script. The downloaded data is in asn1 format. Here is an example of a data entry:

Pubmed-entry ::= {
  pmid 31782536,
  medent {
    em std {
      year 2019,
      month 11,
      day 30,
      hour 6,
      minute 0
    },
    cit {
      title {
        name "Impact of CYP2C19 genotype and drug interactions on voriconazole
 plasma concentrations: a spain pharmacogenetic-pharmacokinetic prospective
 multicenter study."
      },
      authors {
        names std {
          {
            name ml "Blanco Dorado S",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
 Pharmacology Group, University Clinical Hospital, Health Research Institute
 of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
 of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
 University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
          },
          {
            name ml "Maronas O",
            affil str "Genomic Medicine Group, Centro Nacional de Genotipado
 (CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
 Santiago de Compostela, Spain."
          },
          {
            name ml "Latorre-Pellicer A",
            affil str "Genomic Medicine Group, Centro Nacional de Genotipado
 (CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
 Santiago de Compostela, Spain."
          },
          {
            name ml "Rodriguez Jato T",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
          },
          {
            name ml "Lopez-Vizcaino A",
            affil str "Pharmacy Department, University Hospital Lucus Augusti
 (HULA). Lugo, Spain."
          },
          {
            name ml "Gomez Marquez A",
            affil str "Pharmacy Department, University Hospital Ourense
 (CHUO). Ourense, Spain."
          },
          {
            name ml "Bardan Garcia B",
            affil str "Pharmacy Department, University Hospital Ferrol (CHUF).
 A Coruna, Spain."
          },
          {
            name ml "Belles Medall D",
            affil str "Pharmacy Department, General University Hospital
 Castellon (GVA). Castellon, Spain."
          },
          {
            name ml "Barbeito Castineiras G",
            affil str "Microbiology Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
          },
          {
            name ml "Perez Del Molino Bernal ML",
            affil str "Microbiology Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
          },
          {
            name ml "Campos-Toimil M",
            affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
 Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Otero Espinar F",
            affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
 Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Blanco Hortas A",
            affil str "Epidemiology Unit. Fundacion Instituto de Investigacion
 Sanitaria de Santiago de Compostela (FIDIS), University Hospital Lucus
 Augusti (HULA), Spain."
          },
          {
            name ml "Duran Pineiro G",
            affil str "Clinical Pharmacology Group, University Clinical
 Hospital, Health Research Institute of Santiago de Compostela (IDIS).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Zarra Ferro I",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
 Pharmacology Group, University Clinical Hospital, Health Research Institute
 of Santiago de Compostela (IDIS). Santiago de Compostela, Spain."
          },
          {
            name ml "Carracedo A",
            affil str "Genomic Medicine Group, Centro Nacional de Genotipado
 (CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
 Santiago de Compostela, Spain.; Galician Foundation of Genomic Medicine,
 Health Research Institute of Santiago de Compostela (IDIS), SERGAS, Santiago
 de Compostela, Spain."
          },
          {
            name ml "Lamas MJ",
            affil str "Clinical Pharmacology Group, University Clinical
 Hospital, Health Research Institute of Santiago de Compostela (IDIS).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Fernandez-Ferreiro A",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
 Pharmacology Group, University Clinical Hospital, Health Research Institute
 of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
 of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
 University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
          }
        }
      },
      from journal {
        title {
          iso-jta "Pharmacotherapy",
          ml-jta "Pharmacotherapy",
          issn "1875-9114",
          name "Pharmacotherapy"
        },
        imp {
          date std {
            year 2019,
            month 11,
            day 29
          },
          language "eng",
          pubstatus aheadofprint,
          history {
            {
              pubstatus other,
              date std {
                year 2019,
                month 11,
                day 30,
                hour 6,
                minute 0
              }
            },
            {
              pubstatus pubmed,
              date std {
                year 2019,
                month 11,
                day 30,
                hour 6,
                minute 0
              }
            },
            {
              pubstatus medline,
              date std {
                year 2019,
                month 11,
                day 30,
                hour 6,
                minute 0
              }
            }
          }
        }
      },
      ids {
        pubmed 31782536,
        doi "10.1002/phar.2351",
        other {
          db "ELocationID doi",
          tag str "10.1002/phar.2351"
        }
      }
    },
    abstract "BACKGROUND: Voriconazole, a first-line agent for the treatment
 of invasive fungal infections, is mainly metabolized by cytochrome P450 (CYP)
 2C19. A significant portion of patients fail to achieve therapeutic
 voriconazole trough concentrations, with a consequently increased risk of
 therapeutic failure. OBJECTIVE: To show the association between
 subtherapeutic voriconazole concentrations and factors affecting voriconazole
 pharmacokinetics: CYP2C19 genotype and drug-drug interactions. METHODS:
 Adults receiving voriconazole for antifungal treatment or prophylaxis were
 included in a multicenter prospective study conducted in Spain. The
 prevalence of subtherapeutic voriconazole troughs were analyzed in the rapid
 metabolizer and ultra-rapid metabolizer patients (RMs and UMs, respectively),
 and compared with the rest of the patients. The relationship between
 voriconazole concentration, CYP2C19 phenotype, adverse events (AEs), and
 drug-drug interactions was also assessed. RESULTS: In this study 78 patients
 were included with a wide variability in voriconazole plasma levels with only
 44.8% of patients attaining trough concentrations within the therapeutic
 range of 1 and 5.5 microg/ml. The allele frequency of *17 variant was found
 to be 29.5%. Compared with patients with other phenotypes, RMs and UMs had a
 lower voriconazole plasma concentration (RM/UM: 1.85+/-0.24 microg/ml versus
 other phenotypes: 2.36+/-0.26 microg/ml, ). Adverse events were more common
 in patients with higher voriconazole concentrations (p<0.05). No association
 between voriconazole trough concentration and other factors (age, weight,
 route of administration, and concomitant administration of enzyme inducer,
 enzyme inhibitor, glucocorticoids, or proton pump inhibitors) was found.
 CONCLUSION: These results suggest the potential clinical utility of using
 CYP2C19 genotype-guided voriconazole dosing to achieve concentrations in the
 therapeutic range in the early course of therapy. Larger studies are needed
 to confirm the impact of pharmacogenetics on voriconazole pharmacokinetics.",
    pmid 31782536,
    pub-type {
      "Journal Article"
    },
    status publisher
  }
}

这就是我被困住的地方.我不知道如何从asn1中提取信息并将其放入pyspark数据帧中.有人可以建议这样做吗?

This is where I am stuck. I do not know how to extract the information from asn1 and get it into a pyspark dataframe. Could anyone suggest a way of doing this?

推荐答案

上面的数据肯定是"ASN.1格式".此格式称为ASN.1值表示法,用于以文本形式表示ASN.1值.(这种格式早于JSON编码规则的标准化.如今,人们可以将JSON用于相同的目的,与ASN.1值表示法相比,JSON的处理方式有所不同).

The above data is definitely in an "ASN.1 format". This format is called ASN.1 Value Notation and is used to represent ASN.1 values textually. (This format pre-dates the standardization of the JSON encoding rules. Today, one could use JSON for the same purpose, with some differences in the way the JSON would be processed compared to the ASN.1 value notation).

YaFred自己指出,YaFred上面发布的ASN.1模式包含一些错误.您自己发布的符号似乎还包含一些错误.我查看了NCBI的整个ASN.1文件,并注意到它们包含一些错误.因此,除非将其固定,否则无法使用符合标准的ASN.1工具(例如ASN.1游乐场)进行处理.这些错误中的一些很容易修复,但是要修复其他错误,则需要了解这些文件的作者的意图.这种状况可能是由于NCBI项目使用了自己的ASN.1工具箱这一事实,该工具箱可能以某种非标准的方式使用ASN.1.

The ASN.1 schema that YaFred posted above contains a few errors, as YaFred himself noted. The notation you posted yourself also seems to contain a few errors. I have looked at the whole set of ASN.1 files of NCBI and noticed that they contain several errors. Because of this, they cannot be handled by a standard-conforming ASN.1 tool (such as the ASN.1 playground) unless they are fixed. Some of those errors are easy to fix, but fixing other errors require knowledge of the intent of the author of those files. This state of affairs is probably due to the fact that the NCBI project uses their own ASN.1 toolkit, which perhaps uses ASN.1 in some non-standard way.

我想在NCBI工具箱中应该有一些方法可以让您解码上述值表示法,因此,如果您是我,我将研究该工具箱.我无法为您提供更好的建议,因为我不知道NCBI工具包.

I would imagine that in the NCBI toolkit there should be some means for you to decode the above value notation, so if I were you I would look into that toolkit. I am unable to give you a better suggestion because I don't know the NCBI toolkit.

这篇关于如何从asn1数据文件中提取数据并将其加载到数据帧中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆