从 R 中的字符串中提取特定关键字 [英] Extract a specific key word from a string in R

查看:137
本文介绍了从 R 中的字符串中提取特定关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的表中有一个地方"列,其中包含有关一个地方的数据,如下所示:

{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 }, { "longitude" :-79.76259,纬度":45.015851},{经度":-71.777492,纬度":45.015851},{经度":-71.777492,纬度":40.777492,代码",count"]38fullName":纽约,美国",boundingBoxType":多边形",URL":https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json",accessLevel":0, "placeType" : "admin", "country" : "美国" }

从这里,我想提取国家名称.我尝试了以下代码:

loc <- t1$placeloc = gsub('"', '', loc)loc = gsub(',', '', loc)

清理字符串,现在看起来像这样:

<预> <代码>{ID:00ed6f0947c230f4名:卡洛奥坎市boundingBoxCoordinates:[[{经度:120.9607709纬度:14.6344661} {经度:120.9607709纬度:14.7873208} {经度:121.1015117纬度:14.7873208} {经度:121.1015117纬度:14.6344661}]] countryCode : PH fullName : Caloocan City National Capital Region boundingBoxType : Polygon URL : https://api.twitter.com/1.1/geo/id/00ed6f0947c230f4.json accessLevel : 0 placeType : city country : Republika ng Pilipinas }"

现在要提取国名,我想使用word()函数:

word(loc, n, sep=fixed(" : "))

where n 在国名的位置我还是没算.但是这个函数在 n=1 时给出了正确的输出,但是对于 n 的任何其他值都会给出错误:

word[loc, "start"] 错误:下标越界

为什么会这样?loc 变量肯定有更多的词与这种分离.或者有人可以建议从该字段中提取国家/地区名称的更好方法吗?

t1 是包含我整个表的数据框.目前,我只对我的表的 place 字段感兴趣,该字段具有上述格式的信息.因此,我尝试使用基本赋值指令将 place 字段加载到名为loc"的单独变量中:

loc <- t1$place

为了将其作为 JSON 读取,place 字段需要用单引号分隔,而不是原来的单引号.我的表中有 200 万行,所以我真的无法手动添加分隔符.

解决方案

这看起来像一个 JSON 对象,因此使用 JSON 解析来提取数据会更容易.

所以如果这是你的字符串值

x <- '{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 },{经度":-79.76259,纬度":45.015851},{经度":-71.777492,纬度":45.015851},{经度":-71.777492,纬度":45.015851}7"代码]38:美国",全名":美国纽约",boundingBoxType":多边形",URL":https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json","accessLevel" : 0, "placeType" : "admin", "country" : "United States" }'

那你就可以了

库(jsonlite)# 或库(RJSOINIO)# 或库(rjson)fromJSON(x)$国家# [1] 美国"

I have a column "place" in my table which contains data about a place that looks like:

{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 }, { "longitude" : -79.76259, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 40.477383 } ] ], "countryCode" : "US", "fullName" : "New York, USA", "boundingBoxType" : "Polygon", "URL" : "https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json", "accessLevel" : 0, "placeType" : "admin", "country" : "United States" }

From this, I want to extract the country name. I have tried the following code:

loc <- t1$place
loc = gsub('"', '', loc)
loc = gsub(',', '', loc)

to clean up the string and now it looks like this:

"{ id : 00ed6f0947c230f4 name : Caloocan City boundingBoxCoordinates : [ [ { longitude : 120.9607709 latitude : 14.6344661 } { longitude : 120.9607709 latitude : 14.7873208 } { longitude : 121.1015117 latitude : 14.7873208 } { longitude : 121.1015117 latitude : 14.6344661 } ] ] countryCode : PH fullName : Caloocan City National Capital Region boundingBoxType : Polygon URL : https://api.twitter.com/1.1/geo/id/00ed6f0947c230f4.json accessLevel : 0 placeType : city country : Republika ng Pilipinas }"

Now to extract the country name, I want to use the word() function:

word(loc, n, sep=fixed(" : "))

where n in the position of the country name I still did not count. But this function gives the correct output when n=1 but gives an error for any other vaue of n:

Error in word[loc, "start"] : subscript out of bounds

Why is that happening? The loc variable certainly has more words with that separation. Or can someone suggest a better way of extracting the country name from that field?

EDIT: t1 is the dataframe that consists my entire table. Presently I am interested only in the place field of my table which has the information in the above mentioned format. Hence I am trying to load the place field into a separate variable called "loc" using the basic assignment instruction:

loc <- t1$place

In order to read it as a JSON, the place field needs to be delimited by single quotes which it is not originally. I have 2 millions rows in my table so I really can't manually add the delimiters.

解决方案

This looks like a JSON object so it would be easier to use a JSON parse to extract the data.

So if this your string value

x <- '{ "id" : "94965b2c45386f87", "name" : "New York", "boundingBoxCoordinates" : [ [ { "longitude" : -79.76259, "latitude" : 40.477383 }, { "longitude" : -79.76259, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 45.015851 }, { "longitude" : -71.777492, "latitude" : 40.477383 } ] ], "countryCode" : "US", "fullName" : "New York, USA", "boundingBoxType" : "Polygon", "URL" : "https://api.twitter.com/1.1/geo/id/94965b2c45386f87.json", "accessLevel" : 0, "placeType" : "admin", "country" : "United States" }'

then you can do

library(jsonlite)
# or library(RJSOINIO)
# or library(rjson)

fromJSON(x)$country
# [1] "United States"

这篇关于从 R 中的字符串中提取特定关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆