如何从免费的非结构化文本中提取结构化文本 [英] How can I extract structured text from free unstructured text

查看:108
本文介绍了如何从免费的非结构化文本中提取结构化文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

专家!



医院的CT报告(以下示例)是作为免费的非结构化文本编写的。我们需要将这些报告中的数据提取到结构化数据表中。

例如



Experts !

The CT reports of a Hospital (example below) were written as free unstructured text. We need to extract data from these reports into structured data table.
for example

hemorrhage = Yes / No
Hydrocephalus = Yes / No
etc..







我的问题是 - 如果你曾经尝试过类似的东西 - 我应该使用什么方法?






My question is - if you ever tried something similar to this before - what approach should i use ?

Technique: Axial images through the brain were acquired from skull base to the vertex with 5 mm
slice thickness. Images were reviewed in brain, subdural and bone window settings.
Findings: There are bilateral areas of low attenuation in periventricular and subcortical white matter,
nonspecific but most compatible with microvascular changes. Cortical sulci and basilar cisterns are
normal in size and configuration. There is disproportionate ventriculomegaly involving lateral and
third ventricles primarily. There is no evidence of obstructing mass lesion. There is no intra or
extraaxial fluid collection. There is no parenchymal hemorrhage or mass lesion. There is no
evidence of acute transcortical infarction. There is no transtentorial herniation or midline shift. There
are bilateral cavernous internal carotid and vertebral arterial calcifications.
Visualized paranasal sinuses are normal. Visualized mastoid air cells and orbits are normal. Patient
is status post bilateral cataract removal surgery. Soft tissues of the scalp are normal. There is no
evidence of osseous fracture or aggressive appearing osseous lesion.
Impression:
Hydrocephalus without evidence of obstructing mass lesion. Acute hydrocephalus cannot be
excluded since there are no prior studies available for comparison. Extensive chronic white matter
changes may mask transependymal CSF edema. Correlate with short-term followup to exclude
acute hydrocephalus. Correlate with clinical symptoms to exclude normal pressure hydrocephalus.







重要提示



[1]这是虚拟数据,不是真实的患者数据

[2]这个项目是用于研究/培训目的,不是主要的护理。




IMPORTANT NOTE

[1] this is dummy data, NOT true patient data
[2] This project is for research / training purposes, NOT primary care.

推荐答案

乍一看我建议的东西类似于map / reduce - 例如字数统计的例子然后将这个例子提供给这些词语暗示什么。



当你必须决定是否一句话是积极的还是消极的 - 即顾问写道患者没有明显的脑震荡迹象 - 是脑震荡是或脑震荡没有?



你要去的是什么需要做的是将文本解析成句子或短语,然后有一个(我建议并行)过程,它接受一个指示短语/单词并在句子中查找它。您还需要一个过程来查找否定。



将这些作为待处理记录存储,然后显示突出显示短语的文本并让临床医生批准或更改每个它已找到的东西。



您还会发现临床医生的词汇量很小且定义明确,因此一旦临床医生对一个短语进行了解码和检查,就可以找到它再次在其他文件中并进行相应的处理。
At first glance I'd suggest something akin to map/reduce - e.g. the "word count" examples then feeding this into a "what do these words imply".

It may be more complex when you have to decide if a phrase is positive or negative - i.e. the consultant writes "The patient had no obvious signs of concussion" - is that Concussion Yes or Concussion No?

What you are going to need to do is to parse the text into sentences or phrases, then have a (I'd suggest parallel) process that takes an indicator phrase/word and looks for it in the sentence. You also need a process to find "negations".

Store these as pending records, then display the text with the phrases highlighted and get a clinician to approve or alter each thing it has found.

You will also find that clinicians have a small and well defined vocabulary so once a phrase has been decoded and checked by the clinician it can be found again in other files and processed accordingly.


有法律(责任)和科学的原因,不应该尝试自动从临床报告中提取摘要数据,如此。医务人员在没有经过培训的情况下使用这些自动提取的数据来理解所涉及的细微差别可能会导致疏忽或致命的病人护理。



您的建议远远超出提取结构化数据:您提议从复杂数据中提取具有临床意义的意义。一项位于人工智能前沿的任务。



正确的策略是为临床医生提供表格(可能是神经科医生,这个案例)填写,他们给出了存在/不存在离散病态概率的估计百分比



像印象和与之相关,是有原因的:将以下断言定为暂定,并表明在进一步的具体调查后需要解释调查结果/观察结果。



我建议你重新设计你的项目。
There are legal (liability), as well as scientific, reasons one should not attempt to automate extracting summary data from clinical reports like this one. Such automated extracted data used by medical staff without the training to appreciate the subtleties involved could lead to negligent, or fatal, patient care.

What you propose goes far beyond "extracting structured data:" you are proposing extracting clinically significant "meaning" from complex data. A task that is at the "frontier" of Artificial Intelligence.

The correct strategy would be to have a form for the clinician (probably a neurologist, in this case) to fill out in which they give estimated percentages of probability for presence/absence of discrete pathologies.

Words like "Impression," and "Correlate with," are there for a reason: to qualify the assertions that follow as tentative, and to indicate that the findings/observations need to be interpreted after further specific investigations.

I suggest you re-design your project.


这篇关于如何从免费的非结构化文本中提取结构化文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆