从字符串列表中,确定哪些是人名,哪些不是人名 [英] From of list of strings, identify which are human names and which are not
问题描述
我有一个像下面这样的向量,并想确定列表中的哪些元素是人名,哪些不是.我找到了humaniformat软件包,该软件包格式化名称,但不幸的是不能确定字符串实际上是否是名称.我还找到了一些用于实体提取的软件包,但它们似乎需要用于词性标记的实际文本,而不是单个名称.
I have a vector like the one below and would like to determine which elements in the list are human names and which are not. I found the humaniformat package, which formats names but unfortunately does not determine if a string is in fact a name. I also found a few packages for entity extraction, but they seem to require actual text for part-of-speech tagging, rather than a single name.
示例
pkd.names.quotes <- c("Mr. Rick Deckard", # Name
"Do Androids Dream of Electric Sheep", # Not a name
"Roy Batty", # Name
"How much is an electric ostrich?", # Not a name
"My schedule for today lists a six-hour self-accusatory depression.", # Not a name
"Upon him the contempt of three planets descended.", # Not a name
"J.F. Sebastian", # Name
"Harry Bryant", # Name
"goat class", # Not a name
"Holden, Dave", # Name
"Leon Kowalski", # Name
"Dr. Eldon Tyrell") # Name
推荐答案
这是一种方法.美国人口普查局在其数据库中列出了> 100次出现的姓氏列表(按频率):全部152,000.如果使用完整列表,则所有字符串均具有名称.例如,"class","him"和"the"是某些语言的名称(尽管不确定哪种语言).同样,有很多名字列表(请参见此帖子).
Here is one approach. The US Census Bureau tabulates a list of surnames occurring > 100 times in its database (with frequency): all 152,000 of them. If you use the full list, all of your strings have a name. For instance, "class", "him" and "the" are names in certain languages (not sure which languages though). Similarly, there are many lists of first names (see this post).
下面的代码获取2000年人口普查中的所有姓氏,并从引用的帖子中获取姓氏列表,然后是每个列表中最常见的10,000个子集,合并并清除列表,并将其用作字典tm
包以标识哪些字符串包含名称.您可以通过更改freq
变量来控制敏感度"(freq = 10,000似乎会生成所需的结果).
The code below grabs all the surnames from the 2000 Census, and a list of first names from the post cited, then subsets to the most common 10,000 on each list, combines and cleans the lists, and uses that as a dictionary in the tm
package to identify which strings contain names. You can control the "sensitivity" by altering the freq
variable (freq=10,000 seems to generate the result you want).
url <- "http://www2.census.gov/topics/genealogy/2000surnames/names.zip"
tf <- tempfile()
download.file(url,tf, mode="wb") # download archive of surname data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
surnames <- read.csv(files[grepl("\\.csv$",files)]) # 152,000 surnames occurring >100 times
url <- "http://deron.meranda.us/data/census-derived-all-first.txt"
firstnames <- read.table(url(url), header=FALSE)
freq <- 10000
dict <- unique(c(tolower(surnames$name[1:freq]), tolower(firstnames$V1[1:freq])))
library(tm)
corp <- Corpus(VectorSource(pkd.names.quotes))
tdm <- TermDocumentMatrix(corp, control=list(tolower=TRUE, dictionary=dict))
m <- as.matrix(tdm)
m <- m[rowSums(m)>0,]
m
# Docs
# Terms 1 2 3 4 5 6 7 8 9 10 11 12
# bryant 0 0 0 0 0 0 0 1 0 0 0 0
# dave 0 0 0 0 0 0 0 0 0 1 0 0
# deckard 1 0 0 0 0 0 0 0 0 0 0 0
# eldon 0 0 0 0 0 0 0 0 0 0 0 1
# harry 0 0 0 0 0 0 0 1 0 0 0 0
# kowalski 0 0 0 0 0 0 0 0 0 0 1 0
# leon 0 0 0 0 0 0 0 0 0 0 1 0
# rick 1 0 0 0 0 0 0 0 0 0 0 0
# roy 0 0 1 0 0 0 0 0 0 0 0 0
# sebastian 0 0 0 0 0 0 1 0 0 0 0 0
# tyrell 0 0 0 0 0 0 0 0 0 0 0 1
which(colSums(m)>0)
# 1 3 7 8 10 11 12
这篇关于从字符串列表中,确定哪些是人名,哪些不是人名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!