R 在另一个数据框中逐列选择数据框中的列 [英] R Selecting column in a data frame by column in another data frame

查看:28
本文介绍了R 在另一个数据框中逐列选择数据框中的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在尝试对数据进行子集化时遇到问题,也许您可​​以帮助我.我需要的是当该列的值等于第二个数据框中的列的值时,将第一个数据框中的数据按列进行子集化.

I am facing a problem when trying to subset my data, maybe you could help me. What I need is to subset data from first data frame by a column when this column value is equal to the value of a column in the second data frame.

以下是我正在使用的数据框:

The following are the dataframes I'm using:

> head(places)
  Zona   Poble     lat       lon      alt
1    1  Zorita 40.7353 -0.165748  691.867
2    1 Morella 40.6287 -0.113284  955.719
3    1 Forcall 40.6621 -0.209759  753.882
4    2 Benasal 40.3943 -0.126111  848.171
5    2    Cati 40.4532  0.060409  667.610
6    2  Fredes 40.7079  0.167981 1194.730

> head(data)
      date   time stat_id     lat     lon    tempc
1 20121122 000000       1 40.7353 -0.1657  7.98737
2 20121122 000000       2 40.6287 -0.1133  6.49903
3 20121122 000000       3 40.6621 -0.2098  7.72955
4 20121122 000000       4 40.3943 -0.1261  7.98837
5 20121122 000000       5 40.4532  0.0604 10.35480
6 20121122 000000       6 40.7079  0.1680  6.00769

如您所见,数据帧位置"中的前三个位置属于 Zona == 1,并与数据帧数据"中的前三个行共享纬度/经度.我想在places.dat 上选择与Zona == i 共享纬度/经度的数据行.

As you can see, three first places in dataframe "places" belong to Zona == 1 and share lat/lon with three first rows in dataframe "data". I would like to select rows in data that share lat/lon with Zona == i on places.dat.

我正在尝试的 R 脚本是

The R script I am trying is

datos=read.table("data.dat",header=T)
places=read.table("places.dat",header=T)

data=as.data.frame(datos)
place=as.data.frame(pobles)

data$time[data$time == 0] = "000000"

subset(data,data$lat == place$lat[place$Zona == 1])

因此,子集将在 data.dat 中每次显示三行,但它只选择三行中的两行,如下

So, subset would show three rows for each time in data.dat but it is only selecting two of three, as it follows

> subset(data,data$lat == place$lat[place$Zona == 1])
         date   time stat_id     lat     lon    tempc
1    20121122 000000       1 40.7353 -0.1657  7.98737
2    20121122 000000       2 40.6287 -0.1133  6.49903
385  20121122  30000       1 40.7353 -0.1657  7.00632
386  20121122  30000       2 40.6287 -0.1133  4.83684
769  20121122  60000       1 40.7353 -0.1657  6.55283
770  20121122  60000       2 40.6287 -0.1133  4.85467
1153 20121122  90000       1 40.7353 -0.1657  6.35216
1154 20121122  90000       2 40.6287 -0.1133  5.66342
1537 20121122 120000       1 40.7353 -0.1657 11.47750
1538 20121122 120000       2 40.6287 -0.1133 10.30310
1921 20121122 150000       1 40.7353 -0.1657 13.87090
1922 20121122 150000       2 40.6287 -0.1133 11.90640
2305 20121122 180000       1 40.7353 -0.1657 10.30840
2306 20121122 180000       2 40.6287 -0.1133  7.61322
2689 20121122 210000       1 40.7353 -0.1657  6.29745
2690 20121122 210000       2 40.6287 -0.1133  6.63173
3073 20121123 000000       1 40.7353 -0.1657  4.78633
3074 20121123 000000       2 40.6287 -0.1133  5.31070
3457 20121123  30000       1 40.7353 -0.1657  6.84001
3458 20121123  30000       2 40.6287 -0.1133  6.88369
3841 20121123  60000       1 40.7353 -0.1657  5.71790

我肯定遗漏了什么,你能帮帮我吗?任何想法或提示将不胜感激.

For sure I'm missing something, could you help me? Any idea or hint will be appreciated.

谢谢

此处提供数据文件:

编辑根据@A.R 的回答,我尝试使用此代码来选择数据,但不确定它是否只是正确的方式.

EDIT Following answer from @A.R I tried this code to select data but not sure if it is just the exact way.

for(i in 1:128) {
  for(j in 1:2) {
    a=sqrt((place$lat[i]-datos$lat[j])^2+(place$lon[i]-datos$lon[j])^2)
    n=which.min(a)
    while(n <= 9344) {
      b=cbind(i,n,datos$tempc[n],place$Zona[i])
      n=n+128
    }
  }
}

并得到:

> b
       i    n           
[1,] 128 9217 10.1198 30

它只给出最后一个 i 值的值,我想保存所有.当然这是一个基本的但我无法弄清楚,请耐心等待,因为我不是有经验的 R 用户.再次感谢

it gives just the value for the last i value, I would like to save all. Sure it is a basic but I can't figure out, please be patient as I'm not a experienced R user. Thanks again

推荐答案

首先您需要将 lon 的小数四舍五入到 4 位数字.可能这就是您遇到问题的原因:

first you need to round the decimals of places lon to 4 digits. Probably this is the reason why you are having problems:

places=read.table("places.dat",header=T)
places=round(places$lon,digits=4)


datos[which((datos$lat==places$lat & datos$lon==places$lon) & places$Zona==1),]

这个条件的结果是总共 146 分.

The result for this condition is a total of 146 points.

编辑 1(根据 Sean 的评论)

Edit 1 (following a comment by Sean)

我在我的回答中假设在地方,纬度是圆的,而不是长的.

I assumed in my anwswer that in places, the lat was rounded and long not.

但正如 Sean 所指出的,比较浮动并不是一个好主意.最好计算每个地点点和数据点之间的距离,并选择距离最小的那个,下面是最小距离(例如点之间距离的一半)在 datos 中),作为匹配的.

But as was pointed out by Sean,comparing floats is not a good idea. It's better to calculate the distance between each places point and datos point, and select the one with the smallest distance, bellow a minimum distance (e.g. half of the distance between the points in datos), as the matching one.

编辑 2

尝试这样的事情:

b=matrix(nrow=dim(places)[1],ncol=5)
a=c()
data.p=c()
n=c()
for(i in 1:dim(places)[1]) {
  for(j in 1:dim(data)[1]) {
    a[j]=sqrt((places$lat[i]-data$lat[j])^2+(places$lon[i]-data$lon[j])^2)
  }   
  data.p[i]=which.min(a)
  n[i]=min(a)
}
b=cbind(places=1:(dim(places)[1]),data=data.p,distance=n,tempc=data$tempc[data.p],Zona=places$Zona)

比做一些查询:

b[which(b[,3]<1),]
b[which(b[,3]<0.00001),]

这篇关于R 在另一个数据框中逐列选择数据框中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆