如何使用Python匹配相似的坐标? [英] How do I match similar coordinates using Python?

查看:128
本文介绍了如何使用Python匹配相似的坐标?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:

我得到了四个数据目录,第一个(让我们称为Cat1)给出了字段1和2中无线电源的坐标(右上角和赤纬,RA和Dec),第二个目录(Cat2)给出字段1中无线电源和红外(IR)源的RA和Dec,第三目录(Cat3)给出字段2中无线电和红外源的RA和Dec,第四目录(Cat4)给出RA和Dec用于字段1和2中的光源.

I have been given four catalogues of data, the first of which (let's call Cat1) gives the coordinates (in right ascension and declination, RA and Dec) for radio sources in fields 1 and 2, the second catalogue (Cat2) gives the RA and Dec for radio sources and infrared (IR) sources in field 1, the third catalogue (Cat3) gives the RA and Dec for radio and IR sources in field 2, and the fourth catalogue (Cat4) gives the RA and Dec for optical sources in fields 1 and 2.

Cat1有大约2000个字段2的源,请记住,某些源实际上在其整个维度上都经过多次测量,例如:源1,源2,源3a,源3b,源3c,源4 ... Cat1有大约3000个字段1的源,其中有些源也是部分的. Cat 1是一个.dat文件,我正在textedit中打开该文件,并将其转换为.txt以与np.genfromtxt一起使用.

Cat1 has approximately 2000 sources for field 2, keeping in mind that some of the sources are actually measured multiple times across their dimensions, e.g.; source 1, source 2, source 3a, source 3b, source 3c, source 4... Cat1 has approximately 3000 sources for field 1, again with some sources being in parts. Cat 1 is a .dat file, which I am opening in textedit, and converted to .txt for use with np.genfromtxt.

Cat2的字段1大约有1700个源. Cat3有大约1700个来源用于字段2. Cat2和Cat3是.csv文件,我正在用数字打开.

Cat2 has approximately 1700 sources for field 1. Cat3 has approximately 1700 sources for field 2. Cat2 and Cat3 are .csv files, which I am opening in Numbers.

Cat4的字段1大约有1200个源,字段2的大约700个源.Cat4是一个.dat文件,我正在textedit中打开该文件,并将其转换为.txt以与np.genfromtxt一起使用.

Cat4 has approximately 1200 sources for field 1, and approximately 700 sources for field 2. Cat4 is a .dat file, which I am opening in textedit, and converted to .txt for use with np.genfromtxt.

还弄清楚了如何在.csv文件中转换Cat1和Cat4.

Also figured out how to convert Cat1 and Cat4 in .csv files.

目标:

目标是将这四个目录合并为一个目录,从而给出Cat2,Cat1和Cat4的RA和Dec(字段1),然后给出Cat3,Cat1和Cat4的RA和Dec(字段2),例如Cat1和Cat4的RA和Dec最接近Cat1或Cat2的RA和Dec,因此可以说它们很可能是同一来源. 重叠程度会有所不同,但我对数据进行了散点图绘制,结果表明在绘图标记大小的精度范围内,每个Cat2和Cat3源都有对应的Cat1和Cat4源,当然还有许多剩余的源在Cat1和Cat4中,它们比Cat2和Cat3中包含更多的来源.

The goal is to combine these four catalogues into one single catalogue, that gives the RA and Dec from Cat2, Cat1 and Cat4 (field 1), then the RA and Dec from Cat3, Cat1 and Cat4 (field 2), such that the RA and Dec from Cat1 and Cat4 are closest to the RA and Dec from Cat1 or Cat2, such that it can be said that they are highly likely to be the same source. The level of overlap will vary, but I have produced scatterplots for the data that shows that there is a corresponding Cat1 and Cat4 source for each Cat2 and Cat3 source, within the accuracy of the size of the plot marker, with of course many leftover sources in Cat1 and Cat4, which contain many more sources than does Cat2 and Cat3.

诀窍在于,因为坐标不完全匹配,所以我需要能够先查看RA并找到最佳匹配值,然后查看对应的Dec,然后检查它是否也是最佳对应值.

The trick is that because the coordinates do not exactly match, I need to be able to first look at RA and find the best matching value, then look at the corresponding Dec, and check that it is also the best corresponding value.

例如,对于Cat2中的源:RA = 53.13360595,Dec = -28.0530758

e.g., For a source in Cat2: RA = 53.13360595, Dec = -28.0530758

Cat1:RA = 53.133496,12月= -27.553401 或RA = 53.133873,12月= -28.054600

Cat1: RA = 53.133496, Dec = -27.553401 or RA = 53.133873, Dec = -28.054600

在这里53.1336在53.1334和53.1338之间,但显然-28.053比-27.553更接近-28.054,因此Cat1中的第二个选择是获胜者.

Here, 53.1336 is equally between 53.1334 and 53.1338, but clearly -28.053 is closer to -28.054 than -27.553, so the second option in Cat1 is the winner.

进度:

到目前为止,我仅凭肉眼就已经将Cat2中的前15个值与Cat1中的值进行了匹配(command + f尽可能多地保留到小数位,然后使用最佳判断),但是显然,这对于整个3400个来源中的效率极低Cat2和Cat3.我只是想亲自了解一下在匹配中期望达到什么样的准确性,不幸的是,有些只匹配到小数点后两位或三位,而有些只匹配到第四或第五位.

So far I have matched the first 15 values in Cat2 to values in Cat1 purely by eye (command+f to as many decimal places as possible, then using best judgement), but clearly this is extremely inefficient for all 3400 sources across Cat2 and Cat3. I just wanted to see for myself what sort of accuracy to be expecting in the matching, and unfortunately, some match only to second or third decimal places, while others match to fourth or fifth.

在生成散点图时,我使用了以下代码:

In producing my scatterplots, I used the code:

cat1 = np.genfromtext('filepath/cat1.txt', delimiter = '   ')
RA_cat1 = cat1[:,][:,0]
Dec_cat1 = cat1[:,][:,1]

然后简单地将RA_cat1与Dec_cat1相对绘制,并针对我的所有目录进行重复.

Then simply plotted RA_cat1 against Dec_cat1, and repeated for all my catalogues.

我现在的问题是,在寻找有关如何生成能够匹配我的坐标的代码的答案时,我已经看到很多答案,其中涉及将数组转换为列表,但是当尝试使用

My problem now is that in searching for answers on how to produce a code that will be able to match my coordinates, I have seen a lot of answers that involve converting the arrays to lists, but when attempting to do so using

cat1list = np.array([RA_cat1, Dec_cat1])
cat1list.tolist()

我最终得到了一个表单列表;

I end up with a list of the form;

[RA1,RA2,RA3,...,RA3000],[Dec1,Dec2,Dec3,...,Dec3000]

[RA1, RA2, RA3,...,RA3000], [Dec1, Dec2, Dec3,...,Dec3000]

而不是我认为会更有用;

instead of what I assume would be the more helpful;

[RA1,Dec1],[RA2,Dec2],...,[RA3000,Dec3000].

[RA1, Dec1], [RA2, Dec2], ... , [RA3000, Dec3000].

此外,对于类似的问题,列表转换成功后,最有用的答案似乎是使用字典,但我不清楚如何使用字典来产生如上所述的比较结果.

Furthermore, for similar questions, the most useful answers once the list conversion has been successful appear to be to use dictionaries, but I am unclear on how to use a dictionary to produce the sorts of comparisons that I described above.

此外,我应该提到的是,一旦我成功完成了这项任务,就要求我重复此过程以获取更大的数据集,但我不确定会增加多少,但我假设可能会有数十个数千个坐标集.

Additionally, I should mention that once I have been successful in this task, I have been asked to repeat the process for a considerably larger set of data, I'm not sure how much larger, but I am assuming perhaps tens of thousands of coordinate sets.

推荐答案

对于所拥有的数据量,您可以计算每对点之间的距离度量.像这样:

For the amount of data you have, you can calculate a distance metric between each pair of points. Something like:

def close_enough(p1, p2):
    # You may need to scale the RA difference with dec. 
    return (p1.RA - p2.RA)**2 + (p1.Dec - p2.Dec)**2) < 0.01

candidates = [(p1,p2) for p1,p2 in itertools.combinations(points, 2)
              if close_enough(p1,p2)]

对于大型数据集,您可能希望使用线扫描算法来仅计算同一邻域中的点的度量.像这样:

For a large data set you may want to use a line sweep algorithm to only calculate the metric for points that are in the same neighborhood. Like this:

import itertools as it
import operator as op
import sortedcontainers     # handy library on Pypi
import time

from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform

Observation = namedtuple("Observation", "dec ra other")

生成一些测试数据

number_of_observations = 5000
field1 = [Observation(uniform(-25.0, -35.0),     # dec
                      uniform(45.0, 55.0),       # ra
                      uniform(0, 10))            # other data
          for shop_id in range(number_of_observations)]

# add in near duplicates
number_of_dups = 1000
dups = []
for obs in sample(field1, number_of_dups):
    dDec = uniform(-0.0001, 0.0001)
    dRA  = uniform(-0.0001, 0.0001)
    dups.append(Observation(obs.dec + dDec, obs.ra + dRA, obs.other))

data = field1 + dups

这是算法:

# Note: dec is first in Observation, so data is sorted by .dec then .ra.
data.sort()

# Parameter that determines the size of a sliding declination window
# and therefore how close two observations need to be to be considered
# observations of the same object.
dec_span = 0.0001

# Result. A list of observation pairs close enough to be considered 
# observations of the same object.
candidates = []

# Sliding declination window.  Within the window, observations are
# ordered by .ra.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('ra'))

# lag_obs is the 'southernmost' observation within the sliding declination window.
observation = iter(data)
lag_obs = next(observation)

# lead_obs is the 'northernmost' observation in the sliding declination window.
for lead_obs in data:

    # Dec of lead_obs represents the leading edge of window.
    window.add(lead_obs)

    # Remove observations further than the trailing edge of window.
    while lead_obs.dec - lag_obs.dec > dec_span:
        window.discard(lag_obs)
        lag_obs = next(observation)

    # Calculate 'east-west' width of window_size at dec of lead_obs
    ra_span = dec_span / cos(radians(lead_obs.dec))
    east_ra = lead_obs.ra + ra_span
    west_ra = lead_obs.ra - ra_span

    # Check all observations in the sliding window within
    # ra_span of lead_obs.
    for other_obs in window.irange_key(west_ra, east_ra):

        if other_obs != lead_obs:
            # lead_obs is at the top center of a box 2 * ra_span wide by 
            # 1 * ra_span tall.  other_obs is is in that box. If desired, 
            # put additional fine-grained 'closeness' tests here. 
            # For example:
            #    average_dec = (other_obs.dec + lead_obs.dec) / 2
            #    delta_dec = other_obs.dec - lead_obs.dec
            #    delta_ra  = other_obs.ra - lead_obs.ra)/cos(radians(average_dec))
            # e.g. if delta_dec**2 + delta_ra**2 < threshold:
            candidates.append((lead_obs, other_obs))

在我的笔记本电脑上,它找到<中的关闭点.十分之一秒.

On my laptop, it finds the close point in < tenth of a second.

这篇关于如何使用Python匹配相似的坐标?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆