从作者隶属关系中提取国家/地区名称 [英] Extracting Country Name from Author Affiliations

查看:124
本文介绍了从作者隶属关系中提取国家/地区名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究从作者隶属关系(PubMed文章)中提取国家/地区名称的可能性,我的示例数据如下:

I am currently exploring the possibility of extracting country name from Author Affiliations (PubMed Articles) my sample data looks like:

Mechanical and Production Engineering Department, National University of Singapore.

Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.

Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.

Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285.

最初,我尝试删除标点符号并将向量分割成单词,然后将其与Wikipedia中的国家/地区名称列表进行比较,但我对此并不成功.

Initially I tried to remove punctuations and split the vector into words and then compared it with a list of country names from Wikipedia but I am not successful at this.

有人可以建议我做一个更好的方法吗?我更喜欢R中的解决方案,因为我必须做进一步的分析并在R中生成图形.

Can anyone please suggest me a better way of doing it? I would prefer the solution in R as I have to do further analysis and generate graphics in R.

推荐答案

这是一个简单的解决方案,可能会让您入门.它利用包含在地图包中的城市和国家/地区数据的数据库.如果您可以拥有更好的数据库,则修改代码应该很简单.

Here is a simple solution that might get you started some of the way. It makes use of a database containing city and country data in the maps package. If you can get hold of a better database, it should be simple to modify the code.

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

# Remove punctuation from data
caa <- gsub(aa, "[[:punct:]]", "")    ### *Edit*

# Split data at word boundaries
saa <- strsplit(caa, " ")

# Match on cities in world.cities
# Assumes that if multiple matches, the last takes precedence, i.e. max()
llply(saa, function(x)x[max(which(x %in% world.cities$name))])

# Match on country in world.countries
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])

这是城市的结果:

[[1]]
[1] "Singapore"

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"

以及国家/地区的结果:

And the result for countries:

[[1]]
[1] "Singapore"

[[2]]
[1] "UK"

[[3]]
[1] "UK"

[[4]]
character(0)

通过一些数据清理,您也许可以执行此操作.

With a bit of data cleanup you may be able to do something with this.

这篇关于从作者隶属关系中提取国家/地区名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆