我如何才能找到数据点集群的中心? [英] How can I find the center of a cluster of data points?

查看:654
本文介绍了我如何才能找到数据点集群的中心?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我每天都绘制直升机的位置,在过去的一年,并提出了如下图:

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map:

任何人看,这将能够告诉我,这架直升机是出自芝加哥的。

Any human looking at this would be able to tell me that this helicopter is based out of Chicago.

我如何能找到相同的结果code?

How can I find the same result in code?

我在寻找这样的事情:

$geoCodeArray = array([GET=http://pastebin.com/grVsbgL9]);
function findHome($geoCodeArray) {
    // magic
    return $geoCode;
}

最终产生这样的:

Ultimately generating something like this:

更新:样本数据集

下面是一个样本数据集的映射:<一href="http://batchgeo.com/map/c3676fe29985f00e1605cd4f86920179">http://batchgeo.com/map/c3676fe29985f00e1605cd4f86920179

Here's a map with a sample dataset: http://batchgeo.com/map/c3676fe29985f00e1605cd4f86920179

下面是150个地理codeS一个引擎收录: http://pastebin.com/grVsbgL9

Here's a pastebin of 150 geocodes: http://pastebin.com/grVsbgL9

以上含150 GEO codeS。第50个是在邻近芝加哥几簇。其余分散在全国各地,包括一些小群在纽约,洛杉矶和旧金山。

The above contains 150 geocodes. The first 50 are in a few clusters close to Chicago. The remaining are scattered throughout the country, including some small clusters in New York, Los Angeles, and San Francisco.

我有这样的一百万(严重)的数据集,我会需要遍历并找出最有可能的家。你的帮助是极大的AP preciated。

I have about a million (seriously) datasets like this that I'll need to iterate through and identify the most likely "home". Your help is greatly appreciated.

更新2:飞机切换到直升机

飞机概念正吸引太多的注意力转向物理机场。该坐标可以在世界任何地方,不只是机场。让我们假设这是一个超级直升机不受物理,燃料,或其他任何东西束缚。它可以降落的地方就是了。 ;)

The airplane concept was drawing too much attention toward physical airports. The coordinates can be anywhere in the world, not just airports. Let's assume it's a super helicopter not bound by physics, fuel, or anything else. It can land where it wants. ;)

推荐答案

即使点都分散在整个地球上,通过将经纬度直角坐标系下的解决方案的工作原理。它确实是一种KDE(核密度估计),但在第一次通过内核的总和仅在数据点进行评价。内核应选择适合的问题。在它下面的code是什么,我会开玩笑地/ presumptuously叫Trossian,即2D²/H²为d≤h和H²/D²为D> H(其中d为欧氏距离和h是带宽 $ global_kernel_radius ),但它也可能是一个高斯(E -d²/2h²),一个Epanechnikov内核(1- D²/H²为D&所述; H,否则为0),或另一个内核。可选的第二通,或者通过求和在本地网格的独立内核,或者通过计算的质心,在这两种情况下在由 $ local_grid_radius A包围本地定义细化搜索。

The following solution works even if the points are scattered all over the Earth, by converting latitude and longitude to Cartesian coordinates. It does a kind of KDE (kernel density estimation), but in a first pass the sum of kernels is evaluated only at the data points. The kernel should be chosen to fit the problem. In the code below it is what I could jokingly/presumptuously call a Trossian, i.e., 2-d²/h² for d≤h and h²/d² for d>h (where d is the Euclidean distance and h is the "bandwidth" $global_kernel_radius), but it could also be a Gaussian (e-d²/2h²), an Epanechnikov kernel (1-d²/h² for d<h, 0 otherwise), or another kernel. An optional second pass refines the search locally, either by summing an independent kernel on a local grid, or by calculating the centroid, in both cases in a surrounding defined by $local_grid_radius.

<强>实质上下,每个点总和的所有点已周围(包括其本身),更称量他们是否更接近(由钟形曲线),并且还通过可选的权重阵列对其称重 $ w_arr 。获胜者是与最高金额的地步。一旦获胜者已经发现,家,我们正在寻找可以通过在本地重复同样的过程中周围的获胜者(使用其他的钟形曲线)中找到,也可以将其估计为所有点的重心从赢家,其中该半径可以是零的给定半径内的

In essence, each point sums all the points it has around (including itself), weighing them more if they are closer (by the bell curve), and also weighing them by the optional weight array $w_arr. The winner is the point with the maximum sum. Once the winner has been found, the "home" we are looking for can be found by repeating the same process locally around the winner (using another bell curve), or it can be estimated to be the "center of mass" of all points within a given radius from the winner, where the radius can be zero.

该算法必须通过选择适当的内核,通过选择如何在本地细化搜索,并通过调整参数进行调整的问题。为示例的数据集,所述Trossian内核为第一遍和Epanechnikov内核用于第二遍,与所有3半径设定为30英里和1英里的栅格步骤可以是一个很好的起点,但前提是两个子集群芝加哥应该被看作是一个大的集群。否则较小半径必须选择

The algorithm must be adapted to the problem by choosing the appropriate kernels, by choosing how to refine the search locally, and by tuning the parameters. For the example dataset, the Trossian kernel for the first pass and the Epanechnikov kernel for the second pass, with all 3 radii set to 30 mi and a grid step of 1 mi could be a good starting point, but only if the two sub-clusters of Chicago should be seen as one big cluster. Otherwise smaller radii must be chosen.

function find_home($lat_arr, $lng_arr, $global_kernel_radius,
                                       $local_kernel_radius,
                                       $local_grid_radius, // 0 for no 2nd pass
                                       $local_grid_step,   // 0 for centroid
                                       $units='mi',
                                       $w_arr=null)
{
   // for lat,lng <-> x,y,z see http://en.wikipedia.org/wiki/Geodetic_datum
   // for K and h see http://en.wikipedia.org/wiki/Kernel_density_estimation

   switch (strtolower($units)) {
      /*  */case 'nm' :
      /*or*/case 'nmi': $m_divisor = 1852;
      break;case  'mi': $m_divisor = 1609.344;
      break;case  'km': $m_divisor = 1000;
      break;case   'm': $m_divisor = 1;
      break;default: return false;
   }
   $a  = 6378137 / $m_divisor; // Earth semi-major axis      (WGS84)
   $e2 = 6.69437999014E-3;     // First eccentricity squared (WGS84)

   $lat_lng_count = count($lat_arr);
   if ( !$w_arr) {
      $w_arr = array_fill(0, $lat_lng_count, 1.0);
   }
   $x_arr = array();
   $y_arr = array();
   $z_arr = array();
   $rad = M_PI / 180;
   $one_e2 = 1 - $e2;
   for ($i = 0; $i < $lat_lng_count; $i++) {
      $lat = $lat_arr[$i];
      $lng = $lng_arr[$i];
      $sin_lat = sin($lat * $rad);
      $sin_lng = sin($lng * $rad);
      $cos_lat = cos($lat * $rad);
      $cos_lng = cos($lng * $rad);
      // height = 0 (!)
      $N = $a / sqrt(1 - $e2 * $sin_lat * $sin_lat);
      $x_arr[$i] = $N * $cos_lat * $cos_lng;
      $y_arr[$i] = $N * $cos_lat * $sin_lng;
      $z_arr[$i] = $N * $one_e2  * $sin_lat;
   }
   $h = $global_kernel_radius;
   $h2 = $h * $h;
   $max_K_sum     = -1;
   $max_K_sum_idx = -1;
   for ($i = 0; $i < $lat_lng_count; $i++) {
      $xi = $x_arr[$i];
      $yi = $y_arr[$i];
      $zi = $z_arr[$i];
      $K_sum  = 0;
      for ($j = 0; $j < $lat_lng_count; $j++) {
         $dx = $xi - $x_arr[$j];
         $dy = $yi - $y_arr[$j];
         $dz = $zi - $z_arr[$j];
         $d2 = $dx * $dx + $dy * $dy + $dz * $dz;
         $K_sum += $w_arr[$j] * ($d2 <= $h2 ? (2 - $d2 / $h2) : $h2 / $d2); // Trossian ;-)
         // $K_sum += $w_arr[$j] * exp(-0.5 * $d2 / $h2); // Gaussian
      }
      if ($max_K_sum < $K_sum) {
          $max_K_sum = $K_sum;
          $max_K_sum_i = $i;
      }
   }
   $winner_x   = $x_arr  [$max_K_sum_i];
   $winner_y   = $y_arr  [$max_K_sum_i];
   $winner_z   = $z_arr  [$max_K_sum_i];
   $winner_lat = $lat_arr[$max_K_sum_i];
   $winner_lng = $lng_arr[$max_K_sum_i];

   $sin_winner_lat = sin($winner_lat * $rad);
   $cos_winner_lat = cos($winner_lat * $rad);
   $sin_winner_lng = sin($winner_lng * $rad);
   $cos_winner_lng = cos($winner_lng * $rad);
   $east_x  = -$local_grid_step * $sin_winner_lng;
   $east_y  =  $local_grid_step * $cos_winner_lng;
   $east_z  =  0;
   $north_x = -$local_grid_step * $sin_winner_lat * $cos_winner_lng;
   $north_y = -$local_grid_step * $sin_winner_lat * $sin_winner_lng;
   $north_z =  $local_grid_step * $cos_winner_lat;

   if ($local_grid_radius > 0 && $local_grid_step > 0) {
      $r = intval($local_grid_radius / $local_grid_step);
      $r2 = $r * $r;
      $h = $local_kernel_radius;
      $h2 = $h * $h;
      $max_L_sum     = -1;
      $max_L_sum_idx = -1;
      for ($i = -$r; $i <= $r; $i++) {
         $winner_east_x = $winner_x + $i * $east_x;
         $winner_east_y = $winner_y + $i * $east_y;
         $winner_east_z = $winner_z + $i * $east_z;
         $j_max = intval(sqrt($r2 - $i * $i));
         for ($j = -$j_max; $j <= $j_max; $j++) {
            $x = $winner_east_x + $j * $north_x;
            $y = $winner_east_y + $j * $north_y;
            $z = $winner_east_z + $j * $north_z;
            $L_sum  = 0;
            for ($k = 0; $k < $lat_lng_count; $k++) {
               $dx = $x - $x_arr[$k];
               $dy = $y - $y_arr[$k];
               $dz = $z - $z_arr[$k];
               $d2 = $dx * $dx + $dy * $dy + $dz * $dz;
               if ($d2 < $h2) {
                  $L_sum += $w_arr[$k] * ($h2 - $d2); // Epanechnikov
               }
            }
            if ($max_L_sum < $L_sum) {
                $max_L_sum = $L_sum;
                $max_L_sum_i = $i;
                $max_L_sum_j = $j;
            }
         }
      }
      $x = $winner_x + $max_L_sum_i * $east_x + $max_L_sum_j * $north_x;
      $y = $winner_y + $max_L_sum_i * $east_y + $max_L_sum_j * $north_y;
      $z = $winner_z + $max_L_sum_i * $east_z + $max_L_sum_j * $north_z;

   } else if ($local_grid_radius > 0) {
      $r = $local_grid_radius;
      $r2 = $r * $r;
      $wx_sum = 0;
      $wy_sum = 0;
      $wz_sum = 0;
      $w_sum  = 0;
      for ($k = 0; $k < $lat_lng_count; $k++) {
         $xk = $x_arr[$k];
         $yk = $y_arr[$k];
         $zk = $z_arr[$k];
         $dx = $winner_x - $xk;
         $dy = $winner_y - $yk;
         $dz = $winner_z - $zk;
         $d2 = $dx * $dx + $dy * $dy + $dz * $dz;
         if ($d2 <= $r2) {
            $wk = $w_arr[$k];
            $wx_sum += $wk * $xk;
            $wy_sum += $wk * $yk;
            $wz_sum += $wk * $zk;
            $w_sum  += $wk;
         }
      }
      $x = $wx_sum / $w_sum;
      $y = $wy_sum / $w_sum;
      $z = $wz_sum / $w_sum;
      $max_L_sum_i = false;
      $max_L_sum_j = false;

   } else {
      return array($winner_lat, $winner_lng, $max_K_sum_i, false, false);
   }

   $deg = 180 / M_PI;
   $a2 = $a * $a;
   $e4 = $e2 * $e2;
   $p = sqrt($x * $x + $y * $y);
   $zeta = (1 - $e2) * $z * $z / $a2;
   $rho  = ($p * $p / $a2 + $zeta - $e4) / 6;
   $rho3 = $rho * $rho * $rho;
   $s = $e4 * $zeta * $p * $p / (4 * $a2);
   $t = pow($s + $rho3 + sqrt($s * ($s + 2 * $rho3)), 1 / 3);
   $u = $rho + $t + $rho * $rho / $t;
   $v = sqrt($u * $u + $e4 * $zeta);
   $w = $e2 * ($u + $v - $zeta) / (2 * $v);
   $k = 1 + $e2 * (sqrt($u + $v + $w * $w) + $w) / ($u + $v);
   $lat = atan($k * $z / $p) * $deg;
   $lng = atan2($y, $x) * $deg;

   return array($lat, $lng, $max_K_sum_i, $max_L_sum_i, $max_L_sum_j);
}

事实上,距离是欧几里德,而不是大圆应该有手头上的工作的影响可以忽略。计算大圆距离会更麻烦,而且会引起的很远点仅重量要显著低级 - 但这些点已经有一个非常低的权重。在原则上,同样的效果可以通过不同的内核来实现。有一个完整的截止超越一定的距离,像Epanechnikov内核内核,没有这个问题的了(实际上)。

The fact that distances are Euclidean and not great-circle should have negligible effects for the task at hand. Calculating great-circle distances would be much more cumbersome, and would cause only the weight of very far points to be significantly lower - but these points already have a very low weight. In principle, the same effect could be achieved by a different kernel. Kernels that have a complete cut-off beyond some distance, like the Epanechnikov kernel, don't have this problem at all (in practice).

对于WGS84基准面纬度,经度和X,Y,Z之间的转换完全给出(尽管没有数值的稳定性保证)更多的是比因为真正需要的参考。如果高度是考虑到,或者如果需要更快的反变换,请参阅维基百科文章

The conversion between lat,lng and x,y,z for the WGS84 datum is given exactly (although without guarantee of numerical stability) more as a reference than because of a true need. If the height is to be taken into account, or if a faster back-conversion is needed, please refer to the Wikipedia article.

的Epanechnikov内核,除了是更多的本地比高斯和Trossian仁,有被最快为第二回路,这是O(纳克),其中g是本地网格点的数量的优点,并且还可以使用在第一循环,这是O(N²)中,如果n是大

The Epanechnikov kernel, besides being "more local" than the Gaussian and Trossian kernels, has the advantage of being the fastest for the second loop, which is O(ng), where g is the number of points of the local grid, and can also be employed in the first loop, which is O(n²), if n is big.

这篇关于我如何才能找到数据点集群的中心?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆