在 Python 中从非结构化文本中提取一个人的年龄 [英] Extracting a person's age from unstructured text in Python

查看:91
本文介绍了在 Python 中从非结构化文本中提取一个人的年龄的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含简短传记的行政文件数据集.我试图通过使用 python 和一些模式匹配来提取人们的年龄.一些句子的例子是:

  • 邦德先生,67 岁,是英国的一名工程师"
  • 34 岁的 Amanda B. Bynes 是一名女演员"
  • 彼得帕克(45 岁)将成为我们的下一任管理员"
  • 迪伦先生今年 46 岁."
  • 史蒂夫·琼斯,年龄:32,"

这些是我在数据集中识别的一些模式.我想补充一点,还有其他模式,但我还没有遇到它们,也不知道如何才能做到这一点.我编写了以下代码,效果很好,但效率很低,因此在整个数据集上运行需要太多时间.

#创建一个可能出现在年龄实例之前的表达式的搜索列表age_search_list = [" " + last_name.lower().strip() + ", age "," " + clean_sec_l​​ast_name.lower().strip() + " age ",last_name.lower().strip() + " age ",full_name.lower().strip() + ", 年龄 ",full_name.lower().strip() + ", "," " + last_name.lower() + ", "," " + last_name.lower().strip() + " \("," " + last_name.lower().strip() + " 是 "]#对于搜索列表中的每个元素对于 age_search_list 中的元素:打印(搜索:",元素)# 检索我们可能有年龄的所有实例对于 re.finditer(element,souptext.lower()) 中的 age_biography_instance:#提取接下来的四个字符age_biography_start = int(age_biography_instance.start())age_instance_start = age_biography_start + len(元素)age_instance_end = age_instance_start + 4age_string = 汤文本[age_instance_start:age_instance_end]#提取年龄应该是多少potential_age = age_string[:-2]#提取接下来的两个字符作为安全检查(即年龄应后跟逗号或点等)age_security_check = age_string[-2:]age_security_check_list = [", ",".",") "," y"]如果age_security_check in age_security_check_list:print("找到",full_name"的潜在年龄实例,":",potential_age)#检查我们提取的是年龄,将其转换为出生年份尝试:potential_age = int(potential_age)print("检测到潜在年龄:",potential_age)如果 18 

我有几个问题:

  • 有没有更有效的方法来提取这些信息?
  • 我应该使用正则表达式吗?
  • 我的文本文档很长,而且有很多.我可以一次搜索所有项目吗?
  • 检测数据集中其他模式的策略是什么?

从数据集中提取的一些句子:

  • 2010 年授予洛夫先生的股权奖励占其总薪酬的 48%"
  • George F. Rubin(14)(15) 68 岁受托人,自:1997 年起."
  • INDRA K. NOOYI,56 岁,自 2006 年起担任百事可乐首席执行官 (CEO)"
  • 47 岁的 Lovallo 先生于 2011 年被任命为财务主管."
  • 79 岁的 Charles Baker 先生是生物技术公司的商业顾问."
  • Botein 先生,43 岁,自我们成立以来一直是我们董事会的成员."

解决方案

由于您的文本必须被处理,而不仅仅是模式匹配,正确的方法是使用众多 NLP 工具之一可用.

您的目标是使用命名实体识别 (NER),这通常基于机器学习模型完成.NER 活动尝试在文本中识别一组确定的实体类型.例如:位置、日期、组织和人员姓名.

虽然不是 100% 精确,这比简单的模式匹配要精确得多(特别是对于英语),因为它依赖于模式以外的其他信息,例如词性 (POS),依赖解析等

看看我使用

  • 34 岁的 Amanda B. Bynes 是一名女演员"

  • 彼得帕克(45 岁)将成为我们的下一任管理员"

  • 迪伦先生今年 46 岁."

  • 史蒂夫·琼斯,年龄:32",

请注意,最后一个是错误的.正如我所说,不是 100%,而是易于使用.

这种方法的一大优势:您不必为数百万种可用可能性中的每一种都制作特殊模式.

最好的事情:您可以将其集成到您的 Python 代码中:

pip install allennlp

还有:

from allennlp.predictors import Predictoral = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine-粒状-ner-model-elmo-2018.12.21.tar.gz")al.predict("你的句子和日期在这里")

然后,查看日期"实体的结果字典.

同样的事情也适用于 Spacy:

!python3 -m spacy 下载 en_core_web_lg进口空间sp_lg = spacy.load('en_core_web_lg'){(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}

(不过,我在那里有过一些糟糕的预测——尽管它被认为更好).

有关更多信息,请阅读 Medium 上这篇有趣的文章:https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

I have a dataset of administrative filings that include short biographies. I am trying to extract people's ages by using python and some pattern matching. Some example of sentences are:

  • "Mr Bond, 67, is an engineer in the UK"
  • "Amanda B. Bynes, 34, is an actress"
  • "Peter Parker (45) will be our next administrator"
  • "Mr. Dylan is 46 years old."
  • "Steve Jones, Age: 32,"

These are some of the patterns I have identified in the dataset. I want to add that there are other patterns, but I have not run into them yet, and not sure how I could get to that. I wrote the following code that works pretty well, but is pretty inefficient so will take too much time to run on the whole dataset.

#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip()  + " \(",
" " + last_name.lower().strip()  + " is "]

#for each element in our search list
for element in age_search_list:
    print("Searching: ",element)

    # retrieve all the instances where we might have an age
    for age_biography_instance in re.finditer(element,souptext.lower()):

        #extract the next four characters
        age_biography_start = int(age_biography_instance.start())
        age_instance_start = age_biography_start + len(element)
        age_instance_end = age_instance_start + 4
        age_string = souptext[age_instance_start:age_instance_end]

        #extract what should be the age
        potential_age = age_string[:-2]

        #extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
        age_security_check = age_string[-2:]
        age_security_check_list = [", ",". ",") "," y"]

        if age_security_check in age_security_check_list:
            print("Potential age instance found for ",full_name,": ",potential_age)

            #check that what we extracted is an age, convert it to birth year
            try:
                potential_age = int(potential_age)
                print("Potential age detected: ",potential_age)
                if 18 < int(potential_age) < 100:
                    sec_birth_year = int(filing_year) - int(potential_age)
                    print("Filing year was: ",filing_year)
                    print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
                    #Now, we save it in the main dataframe
                    new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
                    df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])

            except ValueError:
                print("Problem with extracted age ",potential_age)

I have a few questions:

  • Is there a more efficient way to extract this information?
  • Should I use a regex instead?
  • My text documents are very long and I have lots of them. Can I do one search for all the items at once?
  • What would be a strategy to detect other patterns in the dataset?

Some sentences extracted from the dataset:

  • "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation"
  • "George F. Rubin(14)(15) Age 68 Trustee since: 1997."
  • "INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006"
  • "Mr. Lovallo, 47, was appointed Treasurer in 2011."
  • "Mr. Charles Baker, 79, is a business advisor to biotechnology companies."
  • "Mr. Botein, age 43, has been a member of our Board since our formation."

解决方案

Since your text has to be processed, and not only pattern matched, the correct approach is to use one of the many NLP tools available out there.

Your aim is to use Named Entity Recognition (NER) which is usually done based on Machine Learning Models. The NER activity attempts to recognize a determined set of Entity Types in text. Examples are: Locations, Dates, Organizations and Person names.

While not 100% precise, this is much more precise than simple pattern matching (especially for english), since it relies on other information other than Patterns, such as Part of Speech (POS), Dependency Parsing, etc.

Take a look on the results I obtained for the phrases you provided by using Allen NLP Online Tool (using fine-grained-NER model):

  • "Mr Bond, 67, is an engineer in the UK":

  • "Amanda B. Bynes, 34, is an actress"

  • "Peter Parker (45) will be our next administrator"

  • "Mr. Dylan is 46 years old."

  • "Steve Jones, Age: 32,"

Notice that this last one is wrong. As I said, not 100%, but easy to use.

The big advantage of this approach: you don't have to make a special pattern for every one of the millions of possibilities available.

The best thing: you can integrate it into your Python code:

pip install allennlp

And:

from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine- 
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")

Then, look at the resulting dict for "Date" Entities.

Same thing goes for Spacy:

!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}

(However, I had some bad experiences with bad predictions there - although it is considered better).

For more info, read this interesting article at Medium: https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

这篇关于在 Python 中从非结构化文本中提取一个人的年龄的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆