从数据库中消除重复的城市 [英] Eliminate duplicate cities from database

查看：190 发布时间：2018/4/23 17:11:16 sql postgresql geolocation country city

本文介绍了从数据库中消除重复的城市的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

超过5300个重复行：

id 纬度经度国家地区城市 2143220，41.3513889，68.9444444，KZ，10，Abay 214321840.899166768.5433333KZ10Abay 1919381，33.8166667，49.6333333，IR，34， Ab Barik 1919377，35.6833333，50.1833333，IR，19，Ab Barik 1919432，29.55，55.5122222 ，29，Abbasabad， 1919430，27.4263889，57.5725，IR，29，Abbasabad 1919413，28.0011111 阿巴斯巴达 1919435，36.5641667，61.14，IR，30，Abbasabad 1919433 ，31.8988889，58.9211111，IR，30，Abbasabad 1919422，33.8666667，48.3，IR，23，Abbasabad b $ b1919420，33.4658333，49.6219444，IR，23，Abbasabad 1919438，33.5333333，49.9833333，IR ，阿巴斯巴德 1919423，33.7619444，49.0747222，IR，24，Abbasabad 1919419，34.2833333，49.2333333 IR 19，`Abbasabad 1919439，35.8833333，52.15，IR，35，Abbasabad 1919417，35.9333333，52.95，IR 17，Abbasabad 1919427，35.7341667，51.4377778，IR，26，Abbasabad 1919425，35.1386111，51.6283333 ，IR，26，Abbasabad 1919713，30.3705556，56.07，IR，29，Abdolabad 1919711 27.983333357.7244444IR29Abdolabad 1919716，35.6025，59.2322222，IR，30，Abdolabad 1919714，34.2197222，56.5447222，IR，30，Abdolabad
其他详细信息：

PostgreSQL 8.4数据库

Linux

问题

有些值显然是重复的（Abay，因为这些区域匹配并且Ab Barik 因为这两个地点都在这样的距离之内），其他人并不那么明显（甚至可能不是真正的重复）：

1919430，27.4263889，57.5725，IR，29，Abbasabad 1919435，36.5641 667，61.14，IR，30，Abbasabad
消除所有重复。

问题

给定一个表格，例如上面的CSV数据： / p>

您如何消除重复？

您会使用哪种以地理为中心的PostgreSQL功能？ li>

您会使用哪些其他标准来调低重复项？

更新

半工作示例代码，用于在同一个国家内选择接近（10公里内）的重复城市名称：

select c1.country，c1.name，c1.region_id，c2.region_id，c1.latitude_decimal，c1.longitude_decimal，c2.latitude_decimal，c2.longitude_decimal from climate.maxmind_city c1， climate.maxmind_city c2 where c1.country ='BE'and c1.id< c2.id和 c1.country = c2.country和 c1.name = c2.name和（c1.latitude_decimal .latitude_decimal或c1.longitude_decimal <> ; c2.longitude_decimal）和 earth_distance（ ll_to_earth（c1.latitude_decimal，c1.longitude_decimal）， ll_to_earth（c2.latitude_decimal，c2.longitude_decimal））<= 10 order by country，name

想法

两阶段方法：

删除min（id ）。

消除那些名字和国家相近的人之间的距离。这可能会删除一些合法的城市，但几乎没有任何后果。

谢谢！

<$ p $ =h2_lin>解决方案
这会删除邻近同一个国家内同一城市的第二个城市：（
选择
最大（c1.id）
从
climate.maxmind_city c1，$ b $从climate.maxmind_city mc中删除b climate.maxmind_city c2
其中
c1.id<> c2.id和
c1.country = c2.country和
c1.name = c2.name和
earth_distance（
ll_to_earth（c1.latitude_decimal，c1.longitude_decimal），
ll_to_earth（c2.latitude_decimal，c2.longitude_decimal））<= 35
group by
c1 .country，c1.name
order by
c1.country，c1.name
）

Background

Over 5300 duplicate rows:
"id","latitude","longitude","country","region","city" "2143220","41.3513889","68.9444444","KZ","10","Abay" "2143218","40.8991667","68.5433333","KZ","10","Abay" "1919381","33.8166667","49.6333333","IR","34","Ab Barik" "1919377","35.6833333","50.1833333","IR","19","Ab Barik" "1919432","29.55","55.5122222","IR","29","`Abbasabad" "1919430","27.4263889","57.5725","IR","29","`Abbasabad" "1919413","28.0011111","58.9005556","IR","12","`Abbasabad" "1919435","36.5641667","61.14","IR","30","`Abbasabad" "1919433","31.8988889","58.9211111","IR","30","`Abbasabad" "1919422","33.8666667","48.3","IR","23","`Abbasabad" "1919420","33.4658333","49.6219444","IR","23","`Abbasabad" "1919438","33.5333333","49.9833333","IR","34","`Abbasabad" "1919423","33.7619444","49.0747222","IR","24","`Abbasabad" "1919419","34.2833333","49.2333333","IR","19","`Abbasabad" "1919439","35.8833333","52.15","IR","35","`Abbasabad" "1919417","35.9333333","52.95","IR","17","`Abbasabad" "1919427","35.7341667","51.4377778","IR","26","`Abbasabad" "1919425","35.1386111","51.6283333","IR","26","`Abbasabad" "1919713","30.3705556","56.07","IR","29","`Abdolabad" "1919711","27.9833333","57.7244444","IR","29","`Abdolabad" "1919716","35.6025","59.2322222","IR","30","`Abdolabad" "1919714","34.2197222","56.5447222","IR","30","`Abdolabad"
Additional details:

PostgreSQL 8.4 Database

Linux

Problem

Some values are obvious duplicates ("Abay" because the regions match and "Ab Barik" because the two locations are within such close proximity), others are not so obvious (and might not even be actual duplicates):
"1919430","27.4263889","57.5725","IR","29","`Abbasabad" "1919435","36.5641667","61.14","IR","30","`Abbasabad"
The goal is to eliminate all duplicates.

Questions

Given a table of values such as the above CSV data:

How would you eliminate duplicates?

What geo-centric PostgreSQL functions would you use?

What other criteria would you use to wheedle down the duplicates?

Update

Semi-working example code to select duplicate city names within the same country that are in close proximity (within 10 km):
select c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal from climate.maxmind_city c1, climate.maxmind_city c2 where c1.country = 'BE' and c1.id <> c2.id and c1.country = c2.country and c1.name = c2.name and (c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and earth_distance( ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ), ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10 order by country, name

Ideas

Two phase approach:

Eliminate the obvious duplicates (same country, region, and city name) by removing the min(id).

Eliminate those within close proximity of each other, having the same name and country. This could remove some legitimate cities, but hardly any of consequence.

Thank you!
解决方案
This deletes the second city within close proximity to a city of the same name in the same country:
delete from climate.maxmind_city mc where id in ( select max(c1.id) from climate.maxmind_city c1, climate.maxmind_city c2 where c1.id <> c2.id and c1.country = c2.country and c1.name = c2.name and earth_distance( ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ), ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 35 group by c1.country, c1.name order by c1.country, c1.name )

这篇关于从数据库中消除重复的城市的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从数据库中消除重复的城市 [英] Eliminate duplicate cities from database

问题描述

问题

问题

更新

想法

Background

Problem

Questions

Update

Ideas

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从数据库中消除重复的城市 [英] Eliminate duplicate cities from database

问题描述

问题

问题

更新

想法

Background

Problem

Questions

Update

Ideas

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭