最接近的值不同的文件,用不同的行数和其他条件(其他庆典AWK) [英] Closest value different files, with different number of lines and other conditions ( bash awk other)

查看:90
本文介绍了最接近的值不同的文件,用不同的行数和其他条件(其他庆典AWK)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要恢复和<一个href=\"http://stackoverflow.com/questions/29827630/match-closest-value-from-two-different-files-and-print-specific-columns\">old 的问题与长文件的修改。

I have to revive and old question with a modification for long files.

我有两颗星在两个文件(文件1和File2)岁。恒星年龄的列是$ 1,列高达$ 13的其余部分是信息,我需要在年底进行打印。

I have the age of two stars in two files (File1 and File2). The column of the age of the stars is $1 and the rest of the columns up to $13 are information that I need to print at the end.

我试图寻找这个时代,明星们同年龄或最接近的时代。由于文件过大(25000〜线),我不希望整个阵列中进行搜索,速度的问题。
此外,他们可以用的行数有很大的区别(让说,在某些情况下〜10000)

I am trying to find an age in which the stars have the same age or the closest age. Since the files are too large (~25000 lines) I don't want to search in the whole array, for speed issues. Also, they could have a big difference in number of lines (let say ~10000 in some cases)

我不知道这是否是解决问题的最好办法,但在缺乏更好的,这是我的想法。 (如果你有一个更快,更有效的方法,请这样做)

I am not sure if this is the best way to solve the problem, but in a lack of a better one, this is my idea. (If you have a faster and more efficient method, please do it)

所有值与precision 12小数。和现在我只在第一列(其中,年龄)的关注。

All the values are with 12 decimals of precision. And for now I am only concern in the first column (where the age is).

和我需要不同的循环。

让我们用这个值从文件1:

Let's use this value from file 1:

2.326062371284e+05

首先,例程应该file2中所有的比赛搜索包含

First the routine should search in file2 all the matches that contain

2.3260e+05

(这个循环可能会在整个阵列中的搜索,但如果有一种方法来尽快停止搜索达到2.3261然后它会节省一些时间)

(This loop probably will search in the whole array, but if there is a way to stop the search as soon it reaches 2.3261 then it will save some time)

如果它发现只有一个,则输出应该是值

If it finds just one, then the output should be that value.

一般,它会发现几行,甚至高达1000。这种情况下,应再次对

Usually, it will find several lines, maybe even up to 1000. It this is the case, it should search again against

2.32606e+05

成立前的字里行间。 (这是一个嵌套的循环,我认为)
然后匹配的数量将减少至〜200

between the lines founded before. (It is a nested loop I think) Then the number of matches will decrease up to ~200

在那一刻,常规应搜索

2.326062371284e+05

和所有的200行。

此方式有这些文件

文件1

1.833800650355e+05 col2f1 col3f1 col4f1
1.959443501406e+05 col2f1 col3f1 col4f1
2.085086352458e+05 col2f1 col3f1 col4f1
2.210729203510e+05 col2f1 col3f1 col4f1
2.326062371284e+05 col2f1 col3f1 col4f1
2.441395539059e+05 col2f1 col3f1 col4f1
2.556728706833e+05 col2f1 col3f1 col4f1

文件2

2.210729203510e+05 col2f2 col3f2 col4f2
2.354895663228e+05 col2f2 col3f2 col4f2
2.499062122946e+05 col2f2 col3f2 col4f2
2.643228582664e+05 col2f2 col3f2 col4f2
2.787395042382e+05 col2f2 col3f2 col4f2
2.921130362004e+05 col2f2 col3f2 col4f2
3.054865681626e+05 col2f2 col3f2 col4f2

输出FILE3(以宽容3000)

Output File3 (with tolerance 3000)

2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2
2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2

重要条件:

输出不应该包含重复行(星1不能在一个固定的时代,不同年龄为2星,就在最近的一个。

The output shouldn't contain repeated lines (the star 1 can't have at a fixed age, different ages for the star 2, just the closest one.

您将如何解决呢?

超级感谢!

PS:我已经完全改变的问题,因为它给我看,我的推理有一些错误。谢谢!

ps: I've change completely the question, since it was showed to me that my reasoning had some errors. Thanks!

推荐答案

不是一个awk的解决方案,来的时候其他的解决方案是太伟大的时代,所以这里使用的研究

Not an awk solution, comes a time when other solutions are great too, so here is an answer using R

不同DATAS新的答案,从文件这个时候不读书烤一个例子:

New answer with different datas, not reading from file this time to bake an example:

# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))

setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column

# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
         data2[ 
           data, # Search for each star of list 1
           .SD, # return columns of file 2
           roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
          .SDcols=c('age','name','dens')]
       ][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")


code:


Code:

# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")

stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")

# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "age"

# Key the data tables (for a fast join with binary search later)
setkey(stars1,'age')
setkey(stars2,'age')

# Get the result (more datils below on what is happening here :))
result=stars2[ stars1, age, roll="nearest", by=.EACHI]

# Rename the columns so we acn filter whole result
setnames(result,make.unique(names(result)))

# Final filter on difference
result[abs(age.1 - age) < 3e3]

所以,有趣的部分是在stars2第一加入两颗恒星年龄列表,寻找每stars1最近的。

So the interesting parts are the first 'join' on the two stars ages list, searching for each in stars1 the nearest in stars2.

这给予(列重命名后):

This give (after column renaming):

> result
        age    age.1
1: 183380.1 221072.9
2: 195944.4 221072.9
3: 208508.6 221072.9
4: 221072.9 221072.9
5: 232606.2 235489.6
6: 244139.6 249906.2
7: 255672.9 249906.2

现在我们最近的每个,过滤那些足够近(绝对差以上3 000这里):

Now we have the nearest for each, filter those close enough (on absolute difference above 3 000 here):

> result[abs(age.1 - age) < 3e3]
        age    age.1
1: 221072.9 221072.9
2: 232606.2 235489.6

这篇关于最接近的值不同的文件,用不同的行数和其他条件(其他庆典AWK)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆