从句子中提取相关日期和位置 [英] Extracting Related Date and Location from a sentence

查看:171
本文介绍了从句子中提取相关日期和位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理包含位置和日期的书面文本(文章和书籍的段落).我想从包含彼此相关的位置和日期的文本对中提取信息.例如,给出以下短语:

I'm working with written text (paragraphs of articles and books) that includes both locations and dates. I want to extract from the texts pairs that contain locations and dates that are associated with one another. For example, given the following phrase:

该名男子于1月离开阿姆斯特丹,并于10月21日到达尼泊尔

我会有这样的输出:

>>>[(Amsterdam, January), (Nepal, October 21st)]

我尝试通过连接词"(例如,和")分割文本,并按如下方式进行部分工作:查找指示位置的词(在",在",从",到" "等)和表示日期或时间的单词(在...上",在…中"等),并加入您发现的内容.但是,这被证明是有问题的,因为太多的单词表示位置和日期,有时基本的查找"方法无法区分它们.

I tried splitting the text through "connecting words" (such as "and" for example) and work on part as follows: find words that indicate a location ("at", "in", "from","to" etc.) and words that indicate a date or time ("on", "during" etc.), and join what you find. However, this proved to be problematic, as there are too much words that indicate location and date, and sometimes the basic "find" method cannot distinguish between them.

假设我能够这样确定一个日期,并给出一个以大写字母开头的单词,那么我就可以确定它是否是一个位置.主要问题是它们之间的连接,并确保它们确实存在.

Assume that I am able to identify a date as such, and given a word that starts with a capital letter, I am able to determine if it is a location or not. The main issue is connecting between them, and making sure they are.

我认为 ntlk scapy 这样的工具会在这里为我提供帮助,但是没有足够的文档来帮助我找到此类问题的确切解决方案.

I figured that tools like ntlk and scapy will assist me here, but there isn't enough documentation to help me find an exact solution to this kind of problem.

任何帮助将不胜感激!

推荐答案

这似乎是一个命名实体识别问题.以下是相同的步骤.要了解详细信息,请参阅这篇文章.

This seems like a Named Entity Recognition problem. Following are the steps to the same. For a detailed understanding, please refer to this article.

  1. 此处下载Stanford NER
  2. 解压缩压缩后的文件夹并保存在驱动器中
  3. 从文件夹中复制"stanford-ner.jar",并将其保存在文件夹外部,如下图所示.
  4. https://stanfordnlp.github.io/CoreNLP/history下载无案例模型. html ,方法如下:第一个链接中的模型也可以使用,但是,即使没有按照正式的语法规则将其大写,无用的模型也有助于识别命名的实体.
  5. 运行以下Python代码.请注意,此代码可在具有Python 2.7版本的Windows 10、64位计算机上使用.
  1. Download Stanford NER from here
  2. Unzip the zipped folder and save in a drive
  3. Copy the "stanford-ner.jar" from the folder and save it just outside the folder as shown in the image below.
  4. Download the caseless models from https://stanfordnlp.github.io/CoreNLP/history.html by clicking on "caseless" as given below. The models in the first link also work however, the caseless models help in identifying named entities even when they are not capitalized as required by formal grammar rules.
  5. Run the following Python code. Please note that this code worked on a windows 10, 64 bit machine with Python 2.7 version.

注意:请确保所有路径都更新为本地计算机上的路径

#Import all the required libraries.
import os
from nltk.tag import StanfordNERTagger
import pandas as pd

#Set environmental variables programmatically.
#Set the classpath to the path where the jar file is located
os.environ['CLASSPATH'] = "<your path>/stanford-ner-2015-04-20/stanford-ner.jar"
#Set the Stanford models to the path where the models are stored
os.environ['STANFORD_MODELS'] = '<your path>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner'

#Set the java jdk path. This code worked with this particular java jdk
java_path = "C:/Program Files/Java/jdk1.8.0_191/bin/java.exe"
os.environ['JAVAHOME'] = java_path


#Set the path to the model that you would like to use
stanford_classifier  =  '<your path>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz'

#Build NER tagger object
st = StanfordNERTagger(stanford_classifier)

#A sample text for NER tagging
text = 'The man left Amsterdam on January and reached Nepal on October 21st'

#Tag the sentence and print output
tagged = st.tag(str(text).split())
print(tagged)
#[(u'The', u'O'), 
# (u'man', u'O'), 
# (u'left', u'O'), 
# (u'Amsterdam', u'LOCATION'), 
# (u'on', u'O'), 
# (u'January', u'DATE'), 
# (u'and', u'O'), 
# (u'reached', u'O'), 
# (u'Nepal', u'LOCATION'), 
# (u'on', u'O'), 
# (u'October', u'DATE'), 
# (u'21st', u'DATE')]

这种方法适用于大多数情况.

This approach works for a majority of the cases.

这篇关于从句子中提取相关日期和位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆