从数据库中消除重复的城市 [英] Eliminate duplicate cities from database

查看:190
本文介绍了从数据库中消除重复的城市的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



超过5300个重复行:

 id 纬度经度国家地区城市
2143220,41.3513889,68.9444444,KZ,10,Abay
214321840.899166768.5433333KZ10Abay
1919381,33.8166667,49.6333333,IR,34, Ab Barik
1919377,35.6833333,50.1833333,IR,19,Ab Barik
1919432,29.55,55.5122222 ,29,Abbasabad,
1919430,27.4263889,57.5725,IR,29,Abbasabad
1919413,28.0011111 阿巴斯巴达
1919435,36.5641667,61.14,IR,30,Abbasabad
1919433 ,31.8988889,58.9211111,IR,30,Abbasabad
1919422,33.8666667,48.3,IR,23,Abbasabad b $ b1919420,33.4658333,49.6219444,IR,23,Abbasabad
1919438,33.5333333,49.9833333,IR ,阿巴斯巴德
1919423,33.7619444,49.0747222,IR,24,Abbasabad
1919419,34.2833333,49.2333333 IR 19,`Abbasabad
1919439,35.8833333,52.15,IR,35,Abbasabad
1919417,35.9333333,52.95,IR 17,Abbasabad
1919427,35.7341667,51.4377778,IR,26,Abbasabad
1919425,35.1386111,51.6283333 ,IR,26,Abbasabad
1919713,30.3705556,56.07,IR,29,Abdolabad
1919711 27.983333357.7244444IR29Abdolabad
1919716,35.6025,59.2322222,IR,30,Abdolabad
1919714,34.2197222,56.5447222,IR,30,Abdolabad

其他详细信息:


  • PostgreSQL 8.4数据库

  • Linux


  • 问题



    有些值显然是重复的(Abay,因为这些区域匹配并且Ab Barik 因为这两个地点都在这样的距离之内),其他人并不那么明显(甚至可能不是真正的重复):

     1919430,27.4263889,57.5725,IR,29,Abbasabad
    1919435,36.5641 667,61.14,IR,30,Abbasabad

    消除所有重复。



    问题



    给定一个表格,例如上面的CSV数据: / p>


    • 您如何消除重复?

    • 您会使用哪种以地理为中心的PostgreSQL功能? li>
    • 您会使用哪些其他标准来调低重复项?



    更新



    半工作示例代码,用于在同一个国家内选择接近(10公里内)的重复城市名称:

      select 
    c1.country,c1.name,c1.region_id,c2.region_id,c1.latitude_decimal,c1.longitude_decimal,c2.latitude_decimal,c2.longitude_decimal
    from
    climate.maxmind_city c1,
    climate.maxmind_city c2
    where
    c1.country ='BE'and
    c1.id< c2.id和
    c1.country = c2.country和
    c1.name = c2.name和
    (c1.latitude_decimal .latitude_decimal或c1.longitude_decimal <> ; c2.longitude_decimal)和
    earth_distance(
    ll_to_earth(c1.latitude_decimal,c1.longitude_decimal),
    ll_to_earth(c2.latitude_decimal,c2.longitude_decimal))<= 10
    order by
    country,name



    想法



    两阶段方法:


    1. 删除min(id )。

    2. 消除那些名字和国家相近的人之间的距离。这可能会删除一些合法的城市,但几乎没有任何后果。

    谢谢!



    <$ p $ =h2_lin>解决方案

    这会删除邻近同一个国家内同一城市的第二个城市: (
    选择
    最大(c1.id)

    climate.maxmind_city c1,$ b $从climate.maxmind_city mc中删除b climate.maxmind_city c2
    其中
    c1.id<> c2.id和
    c1.country = c2.country和
    c1.name = c2.name和
    earth_distance(
    ll_to_earth(c1.latitude_decimal,c1.longitude_decimal),
    ll_to_earth(c2.latitude_decimal,c2.longitude_decimal))<= 35
    group by
    c1 .country,c1.name
    order by
    c1.country,c1.name


    Background

    Over 5300 duplicate rows:

    "id","latitude","longitude","country","region","city"
    "2143220","41.3513889","68.9444444","KZ","10","Abay"
    "2143218","40.8991667","68.5433333","KZ","10","Abay"
    "1919381","33.8166667","49.6333333","IR","34","Ab Barik"
    "1919377","35.6833333","50.1833333","IR","19","Ab Barik"
    "1919432","29.55","55.5122222","IR","29","`Abbasabad"
    "1919430","27.4263889","57.5725","IR","29","`Abbasabad"
    "1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
    "1919435","36.5641667","61.14","IR","30","`Abbasabad"
    "1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
    "1919422","33.8666667","48.3","IR","23","`Abbasabad"
    "1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
    "1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
    "1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
    "1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
    "1919439","35.8833333","52.15","IR","35","`Abbasabad"
    "1919417","35.9333333","52.95","IR","17","`Abbasabad"
    "1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
    "1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
    "1919713","30.3705556","56.07","IR","29","`Abdolabad"
    "1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
    "1919716","35.6025","59.2322222","IR","30","`Abdolabad"
    "1919714","34.2197222","56.5447222","IR","30","`Abdolabad"
    

    Additional details:

    • PostgreSQL 8.4 Database
    • Linux

    Problem

    Some values are obvious duplicates ("Abay" because the regions match and "Ab Barik" because the two locations are within such close proximity), others are not so obvious (and might not even be actual duplicates):

    "1919430","27.4263889","57.5725","IR","29","`Abbasabad"
    "1919435","36.5641667","61.14","IR","30","`Abbasabad"
    

    The goal is to eliminate all duplicates.

    Questions

    Given a table of values such as the above CSV data:

    • How would you eliminate duplicates?
    • What geo-centric PostgreSQL functions would you use?
    • What other criteria would you use to wheedle down the duplicates?

    Update

    Semi-working example code to select duplicate city names within the same country that are in close proximity (within 10 km):

    select
      c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
    from
      climate.maxmind_city c1,
      climate.maxmind_city c2
    where
      c1.country = 'BE' and
      c1.id <> c2.id and
      c1.country = c2.country and
      c1.name = c2.name and
      (c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
      earth_distance(
        ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
        ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
    order by
      country, name
    

    Ideas

    Two phase approach:

    1. Eliminate the obvious duplicates (same country, region, and city name) by removing the min(id).
    2. Eliminate those within close proximity of each other, having the same name and country. This could remove some legitimate cities, but hardly any of consequence.

    Thank you!

    解决方案

    This deletes the second city within close proximity to a city of the same name in the same country:

    delete from climate.maxmind_city mc where id in (
    select
      max(c1.id)
    from
      climate.maxmind_city c1,
      climate.maxmind_city c2
    where
      c1.id <> c2.id and
      c1.country = c2.country and
      c1.name = c2.name and
      earth_distance(
        ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
        ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 35
    group by
      c1.country, c1.name
    order by
      c1.country, c1.name
    )
    

    这篇关于从数据库中消除重复的城市的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆