从数据库中消除重复的城市 [英] Eliminate duplicate cities from database
问题描述
超过5300个重复行:
id 纬度经度国家地区城市
2143220,41.3513889,68.9444444,KZ,10,Abay
214321840.899166768.5433333KZ10Abay
1919381,33.8166667,49.6333333,IR,34, Ab Barik
1919377,35.6833333,50.1833333,IR,19,Ab Barik
1919432,29.55,55.5122222 ,29,Abbasabad,
1919430,27.4263889,57.5725,IR,29,Abbasabad
1919413,28.0011111 阿巴斯巴达
1919435,36.5641667,61.14,IR,30,Abbasabad
1919433 ,31.8988889,58.9211111,IR,30,Abbasabad
1919422,33.8666667,48.3,IR,23,Abbasabad b $ b1919420,33.4658333,49.6219444,IR,23,Abbasabad
1919438,33.5333333,49.9833333,IR ,阿巴斯巴德
1919423,33.7619444,49.0747222,IR,24,Abbasabad
1919419,34.2833333,49.2333333 IR 19,`Abbasabad
1919439,35.8833333,52.15,IR,35,Abbasabad
1919417,35.9333333,52.95,IR 17,Abbasabad
1919427,35.7341667,51.4377778,IR,26,Abbasabad
1919425,35.1386111,51.6283333 ,IR,26,Abbasabad
1919713,30.3705556,56.07,IR,29,Abdolabad
1919711 27.983333357.7244444IR29Abdolabad
1919716,35.6025,59.2322222,IR,30,Abdolabad
1919714,34.2197222,56.5447222,IR,30,Abdolabad
其他详细信息: 有些值显然是重复的(Abay,因为这些区域匹配并且Ab Barik 因为这两个地点都在这样的距离之内),其他人并不那么明显(甚至可能不是真正的重复):
问题
1919430,27.4263889,57.5725,IR,29,Abbasabad
1919435,36.5641 667,61.14,IR,30,Abbasabad
消除所有重复。
问题
给定一个表格,例如上面的CSV数据: / p>
- 您如何消除重复?
- 您会使用哪种以地理为中心的PostgreSQL功能? li>
- 您会使用哪些其他标准来调低重复项?
更新
半工作示例代码,用于在同一个国家内选择接近(10公里内)的重复城市名称:
select
c1.country,c1.name,c1.region_id,c2.region_id,c1.latitude_decimal,c1.longitude_decimal,c2.latitude_decimal,c2.longitude_decimal
from
climate.maxmind_city c1,
climate.maxmind_city c2
where
c1.country ='BE'and
c1.id< c2.id和
c1.country = c2.country和
c1.name = c2.name和
(c1.latitude_decimal .latitude_decimal或c1.longitude_decimal <> ; c2.longitude_decimal)和
earth_distance(
ll_to_earth(c1.latitude_decimal,c1.longitude_decimal),
ll_to_earth(c2.latitude_decimal,c2.longitude_decimal))<= 10
order by
country,name
想法
两阶段方法:
- 删除min(id )。
- 消除那些名字和国家相近的人之间的距离。这可能会删除一些合法的城市,但几乎没有任何后果。
谢谢!
<$ p $ =h2_lin>解决方案
这会删除邻近同一个国家内同一城市的第二个城市: (
选择
最大(c1.id)
从
climate.maxmind_city c1,$ b $从climate.maxmind_city mc中删除b climate.maxmind_city c2
其中
c1.id<> c2.id和
c1.country = c2.country和
c1.name = c2.name和
earth_distance(
ll_to_earth(c1.latitude_decimal,c1.longitude_decimal),
ll_to_earth(c2.latitude_decimal,c2.longitude_decimal))<= 35
group by
c1 .country,c1.name
order by
c1.country,c1.name
)
Background
Over 5300 duplicate rows:
"id","latitude","longitude","country","region","city"
"2143220","41.3513889","68.9444444","KZ","10","Abay"
"2143218","40.8991667","68.5433333","KZ","10","Abay"
"1919381","33.8166667","49.6333333","IR","34","Ab Barik"
"1919377","35.6833333","50.1833333","IR","19","Ab Barik"
"1919432","29.55","55.5122222","IR","29","`Abbasabad"
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
"1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
"1919422","33.8666667","48.3","IR","23","`Abbasabad"
"1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
"1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
"1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
"1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
"1919439","35.8833333","52.15","IR","35","`Abbasabad"
"1919417","35.9333333","52.95","IR","17","`Abbasabad"
"1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
"1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
"1919713","30.3705556","56.07","IR","29","`Abdolabad"
"1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
"1919716","35.6025","59.2322222","IR","30","`Abdolabad"
"1919714","34.2197222","56.5447222","IR","30","`Abdolabad"
Additional details:
- PostgreSQL 8.4 Database
- Linux
Problem
Some values are obvious duplicates ("Abay" because the regions match and "Ab Barik" because the two locations are within such close proximity), others are not so obvious (and might not even be actual duplicates):
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
The goal is to eliminate all duplicates.
Questions
Given a table of values such as the above CSV data:
- How would you eliminate duplicates?
- What geo-centric PostgreSQL functions would you use?
- What other criteria would you use to wheedle down the duplicates?
Update
Semi-working example code to select duplicate city names within the same country that are in close proximity (within 10 km):
select
c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
from
climate.maxmind_city c1,
climate.maxmind_city c2
where
c1.country = 'BE' and
c1.id <> c2.id and
c1.country = c2.country and
c1.name = c2.name and
(c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
earth_distance(
ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
order by
country, name
Ideas
Two phase approach:
- Eliminate the obvious duplicates (same country, region, and city name) by removing the min(id).
- Eliminate those within close proximity of each other, having the same name and country. This could remove some legitimate cities, but hardly any of consequence.
Thank you!
This deletes the second city within close proximity to a city of the same name in the same country:
delete from climate.maxmind_city mc where id in (
select
max(c1.id)
from
climate.maxmind_city c1,
climate.maxmind_city c2
where
c1.id <> c2.id and
c1.country = c2.country and
c1.name = c2.name and
earth_distance(
ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 35
group by
c1.country, c1.name
order by
c1.country, c1.name
)
这篇关于从数据库中消除重复的城市的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!