从作者单位中提取国家名称 [英] Extracting Country Name from Author Affiliations

查看:25
本文介绍了从作者单位中提取国家名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在探索从作者单位(PubMed 文章)中提取国家名称的可能性,我的示例数据如下所示:

I am currently exploring the possibility of extracting country name from Author Affiliations (PubMed Articles) my sample data looks like:

新加坡国立大学机械与生产工程系.

英国剑桥动物学系癌症研究运动哺乳动物细胞 DNA 修复小组

英国剑桥动物学系癌症研究运动哺乳动物细胞 DNA 修复小组.

礼来研究实验室,礼来公司,印第安纳波利斯,印第安纳州 46285.

最初我尝试删除标点符号并将向量拆分为单词,然后将其与来自维基百科的国家名称列表进行比较,但我没有成功.

Initially I tried to remove punctuations and split the vector into words and then compared it with a list of country names from Wikipedia but I am not successful at this.

任何人都可以建议我更好的方法吗?我更喜欢 R 中的解决方案,因为我必须在 R 中做进一步的分析和生成图形.

Can anyone please suggest me a better way of doing it? I would prefer the solution in R as I have to do further analysis and generate graphics in R.

推荐答案

这里有一个简单的解决方案,可以帮助您开始一些工作.它利用地图包中包含城市和国家数据的数据库.如果你能拿到一个更好的数据库,修改代码应该很简单.

Here is a simple solution that might get you started some of the way. It makes use of a database containing city and country data in the maps package. If you can get hold of a better database, it should be simple to modify the code.

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

# Remove punctuation from data
caa <- gsub(aa, "[[:punct:]]", "")    ### *Edit*

# Split data at word boundaries
saa <- strsplit(caa, " ")

# Match on cities in world.cities
# Assumes that if multiple matches, the last takes precedence, i.e. max()
llply(saa, function(x)x[max(which(x %in% world.cities$name))])

# Match on country in world.countries
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])

这是城市的结果:

[[1]]
[1] "Singapore"

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"

国家/地区的结果:

[[1]]
[1] "Singapore"

[[2]]
[1] "UK"

[[3]]
[1] "UK"

[[4]]
character(0)

通过一些数据清理,您或许可以对此做些事情.

With a bit of data cleanup you may be able to do something with this.

这篇关于从作者单位中提取国家名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆