Pyparsing:提取可变长度,可变内容,可变空白子字符串 [英] Pyparsing: extract variable length, variable content, variable whitespace substring

查看:100
本文介绍了Pyparsing:提取可变长度,可变内容,可变空白子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从前列腺切除术最终诊断记录的平面文件中提取格里森分数.这些分数始终带有单词Gleason和两个数字,这些数字加起来便是另一个数字.在过去的二十多年中,人类一直在打字.包括各种空白约定和修饰符.以下是到目前为止我的Backus-Naur表格和两个示例记录.仅针对前列腺切除术,我们正在研究一千多例.

I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another number. Humans typed these in over two decades. Various conventions of whitespace and modifiers are included. Below is my Backus-Naur form so far, and two example records. Just for prostatectomies, we're looking at upwards of a thousand cases.

我之所以使用pyparsing是因为我正在学习python,并且对自己非常有限的正则表达式编写经验并不怀念.

I am using pyparsing because I'm learning python, and have no fond memories of my very limited exposure to regex writing.

我的问题:如何在不解析最终诊断中可能存在或可能不存在的所有其他可选数据的情况下剔除这些格里森分数?

My question: how can I pluck out these Gleason grades without parsing every single other optional piece of data that may or may not be in these final diagnoses?

num = Word(nums)
record ::= accessionDate + accessionNumber + patMedicalRecordNum + finalDxText
accessionDate ::= num + "/" + num + "/" num
accessionNumber ::= "S" + num + "-" + num
patMedicalRecordNum ::= num + "/" + num + "-" + num + "-" + num
finalDxText ::= listOfParts + optionalComment + optionalpTNMStage
listOfParts ::= OneOrMore(part)
part ::= <multiline idiosyncratic freetext which may contain a Gleason score I want> + optionalpTNMStage
optionalComment ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
optionalpTNMStage ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>


01/01/11  S11-55555 20/444-55-6666 A.  PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                           
                                   -  ADENOCARCINOMA.                                                      

                                   TOTAL GLEASON SCORE:  GLEASON 5+4=9                                     
                                   TUMOR LOCATION:  BILATERAL                                              
                                   TUMOR QUANTITATION:  15% OF PROSTATE INVOLVED BY TUMOR                  
                                   EXTRAPROSTATIC EXTENSION:  PRESENT AT RIGHT POSTERIOR                   
                                   SEMINAL VESICLE INVASION:  PRESENT                                      
                                   MARGINS:  UNINVOLVED                                                    
                                   LYMPHOVASCULAR INVASION:  PRESENT                                       
                                   PERINEURAL INVASION:  PRESENT                                           
                                   LYMPH NODES (SPECIMENS B AND C):                                        
                                      NUMBER EXAMINED:  25                                                 
                                      NUMBER INVOLVED:  1                                                  
                                      DIAMETER OF LARGEST METASTASIS:  1.7 mm                              
                                   ADDITIONAL FINDINGS:  HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,   
                                      ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE    
                                      CARCINOMA                                                            

                                   PATHOLOGIC STAGE:  pT3b N1 MX                                           

                               B.  LYMPH NODES, RIGHT PELVIC, EXCISION:                                    
                                   -  ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).         

                               C.  LYMPH NODES, LEFT PELVIC, EXCISION:                                     
                                   -  EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).                     
01/02/11  S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                               
                                  - ADENOCARCINOMA.                                                        
                                    GLEASON SCORE:  3 + 3 = 6 WITH TERTIARY PATTERN OF 5.                                             
                                    TUMOR QUANTITATION:  APPROXIMATELY 10% BY VOLUME.                      
                                    TUMOR LOCATION:  BILATERAL.                                            
                                    EXTRAPROSTATIC EXTENSION:  NOT IDENTIFIED.                             
                                    MARGINS:  NEGATIVE.                                                    
                                    PERINEURAL INVASION:  IDENTIFIED.                                      
                                    LYMPH-VASCULAR INVASION:  NOT IDENTIFIED.                              
                                    SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.              
                                    LYMPH NODES:  NONE SUBMITTED.                                          
                                    OTHER:  HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.                
                               PATHOLOGIC STAGE (pTNM):  pT2c NX.                                       

完全公开:我是一名从事研究的医师;这是我第一次使用python.我已经阅读了Lutz的《学习Python》,Shaw的《艰难的学习Python》,并研究了各种问题.我在该论坛,pyparsing Wiki上审查了许多与pyparsing相关的问题,并且我购买并阅读了McGuire先生的"Pyparsing入门".也许我是在问一个问题,何时才应该真正告诉我我站在沮丧的死亡螺旋,当您必须编写解析器时,这种螺旋非常普遍"(McGuire,17岁)?我不知道.到目前为止,我很高兴能够从事实际上可能是一个真正的项目.

Full disclosure: I'm a physician doing research; this is my first real work with python. I have read Lutz's Learning Python, Shaw's Learning Python the Hard Way, and worked through various problem sets. I have reviewed numerous pyparsing related questions on this forum, the pyparsing wiki, and I bought and read Mr McGuire's Getting Started with Pyparsing. Perhaps I am asking a question when I should really be told I am standing at "The death spiral of frustation that is so common when you have to write parsers" (McGuire, 17)? I don't know. So far I'm just happy to be working on what may actually be a real project.

推荐答案

以下是提取患者数据和所有匹配的Gleason数据的示例.

Here is a sample to pull out the patient data and any matching Gleason data.

from pyparsing import *
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
assert 'GLEASON 5+4=9' == gleason
assert 'GLEASON SCORE:  3 + 3 = 6' == gleason

patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
assert '01/02/11  S11-4444 20/111-22-3333' == patientData

partMatch = patientData("patientData") | gleason("gleason")

lastPatientData = None
for match in partMatch.searchString(data):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!"
            continue
        print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
                        lastPatientData.patientData, match.gleason
                        )

打印:

01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)

这篇关于Pyparsing:提取可变长度,可变内容,可变空白子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆