如何使我的GTFS查询运行得更快? [英] How can I make my GTFS queries run faster?

查看:75
本文介绍了如何使我的GTFS查询运行得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用GTFS数据库,即RATP为巴黎及其郊区提供的数据库.

I'm trying to play with a GTFS database, namely the one provided by the RATP for Paris and its suburbs.

数据集为巨大. stop_times表具有1400万行.

The set of data is huge. The stop_times table has 14 million rows.

这是表模式: https://github.com/mauryquijada/gtfs-mysql/blob/master/gtfs-sql.sql

我正在尝试以最有效的方式在特定位置获取可用路线.据我了解的GTFS规范,以下是这些表格及其从我的数据(经/纬度)到路线的链接:

I'm trying to get the most efficient way to get the available routes at a specific location. As far as I understand the GTFS spec, here are the tables and their links from my data (lat/lon) to the routes:

stops      | stop_times     | trips      | routes
-----------+----------------+------------+--------------
lat        | stop_id        | trip_id    | route_id
lon        | trip_id        | route_id   |
stop_id    |                |            |

为了清晰起见,我分三步(实际上是我们在上述四个表之间的三个链接)编译了我想要的东西:

I have compiled what I want in three steps (actually the three links we have between the four tables above), published under this gist for clarity: https://gist.github.com/BenoitDuffez/4eba85e3598ebe6ece5f

这是我创建此脚本的方式.

Here's how I created this script.

我能够在不到一秒钟的时间内迅速找到步行距离(例如200m)内的所有站点.我使用:

I have been able to quickly find all the stops within a walking distance (say, 200m) in less than a second. I use:

$ . mysql.ini && time mysql -h $host -N -B -u $user -p${pass} $name -e "SELECT stop_id, (6371000*acos(cos(radians(48.824699))*cos(radians(s.stop_lat))*cos(radians(2.3243)-radians(s.stop_lon))+sin(radians(48.824699))*sin(radians(s.stop_lat)))) AS distance
FROM stops s
GROUP BY s.stop_id
HAVING distance < 200
ORDER BY distance ASC" | awk '{print $1}'
3705271
4472979
4036891
4036566
3908953
3908755
3900765
3900693
3900607
4473141
3705272
4472978
4036892
4036472
4035057
3908952
3705288
3908814
3900832
3900672
3900752
3781623
3781622

real    0m0.797s
user    0m0.000s
sys     0m0.000s

然后,今天晚些时候(使用stop_times.departure_time > '``date +%T``')获取所有stop_times会花费很多时间:

Then, getting all the stop_times later today (with stop_times.departure_time > '``date +%T``') takes a lot of time:

"SELECT trip_id
 FROM stop_times
 WHERE
     stop_id IN ($stops) AND departure_time >= '$now'

 GROUP BY trip_id"

其中$stops包含从第一步获得的停靠点列表.这是一个示例:

With $stops containing the list of stops obtained from the first step. Here's an example:

$ . mysql.ini && time mysql -h $host -N -B -u $user -p${pass} $name -e "SELECT stop_id, (6371000*acos(cos(radians(
FROM stops s
GROUP BY s.stop_id
HAVING distance < 200
ORDER BY distance ASC" | awk '{print $1}'
3705271
4472979
4036891
4036566
3908953
...
9916360850964321
9916360920964320
9916360920964321

real    1m21.399s
user    0m0.000s
sys     0m0.000s

此结果中有超过2000行.

There are more than 2000 lines in this result.

我的最后一步是选择与这些trip_id相匹配的所有路线.这相当容易,而且相当快捷:

My last step was to select all routes that match these trip_ids. It's quite easy, and rather fast:

$ . mysql.ini && time mysql -h $host -u $user -p${pass} $name -e "SELECT r.id, r.route_long_name FROM trips t, routes r WHERE t.trip_id IN (`cat trip_ids | tr '\n' '#' | sed -e 's/##$//' -e 's/#/,/g'`) AND r.route_id = t.route_id GROUP BY t.route_id"
+------+-------------------------------------------------------------------------+
| id   | route_long_name                                                         |
+------+-------------------------------------------------------------------------+
|  290 | (PLACE DE CLICHY <-> CHATILLON METRO) - Aller                           |
|  291 | (PLACE DE CLICHY <-> CHATILLON METRO) - Retour                          |
|  404 | (PORTE D'ORLEANS-METRO <-> ECOLE VETERINAIRE DE MAISON-ALFORT) - Aller  |
|  405 | (PORTE D'ORLEANS-METRO <-> ECOLE VETERINAIRE DE MAISON-ALFORT) - Retour |
|  453 | (PORTE D'ORLEANS-METRO <-> LYCEE POLYVALENT) - Retour                   |
|  457 | (PORTE D'ORLEANS-METRO <-> LYCEE POLYVALENT) - Retour                   |
|  479 | (PORTE D'ORLEANS-METRO <-> VELIZY 2) - Retour                           |
|  810 | (PLACE DE LA LIBERATION <-> GARE MONTPARNASSE) - Aller                  |
|  989 | (PORTE D'ORLEANS-METRO) - Retour                                        |
| 1034 | (PLACE DE LA LIBERATION <-> HOTEL DE VILLE DE PARIS_4E__AR) - Aller     |
+------+-------------------------------------------------------------------------+

real    0m1.070s
user    0m0.000s
sys     0m0.000s

文件trip_ids包含2k行程ID.

With here the file trip_ids containing the 2k trip IDs.

如何更快地获得此结果?有没有比我采用的stops>stop_times>trips>routes路径更好的爬取数据的方法了?

How can I get this result faster? Is there a better way to crawl through the data rather than the stops>stop_times>trips>routes path I have taken?

对于一个实际的查询",这里的总时间约为30s:从该位置到200m的可用路线是什么?".太多了...

The total time here is around 30s for actually ONE 'query': "What are the routes available 200m from this location?". That's too much...

推荐答案

我使用的表模式是完全错误的,我应该自己构建它,或者至少在使用它之前对其进行分析.

The table schema I used was plain wrong, I should've built it myself or at least analyze it before using it.

这是更新的架构:

CREATE TABLE `agency` (
    transit_system VARCHAR(50) NOT NULL,
    agency_id VARCHAR(100),
    agency_name VARCHAR(255) NOT NULL,
    agency_url VARCHAR(255) NOT NULL,
    agency_timezone VARCHAR(100) NOT NULL,
    agency_lang VARCHAR(100),
    agency_phone VARCHAR(100),
    agency_fare_url VARCHAR(100),
    PRIMARY KEY (agency_id)
);

CREATE TABLE `calendar_dates` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    service_id VARCHAR(255) NOT NULL,
    `date` VARCHAR(8) NOT NULL,
    exception_type TINYINT(2) NOT NULL,
    KEY `service_id` (service_id),
    KEY `exception_type` (exception_type)    
);

CREATE TABLE `calendar` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    service_id VARCHAR(255) NOT NULL,
    monday TINYINT(1) NOT NULL,
    tuesday TINYINT(1) NOT NULL,
    wednesday TINYINT(1) NOT NULL,
    thursday TINYINT(1) NOT NULL,
    friday TINYINT(1) NOT NULL,
    saturday TINYINT(1) NOT NULL,
    sunday TINYINT(1) NOT NULL,
    start_date VARCHAR(8) NOT NULL, 
    end_date VARCHAR(8) NOT NULL,
    KEY `service_id` (service_id)
);

CREATE TABLE `fare_attributes` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    fare_id VARCHAR(100),
    price VARCHAR(50) NOT NULL,
    currency_type VARCHAR(50) NOT NULL,
    payment_method TINYINT(1) NOT NULL,
    transfers TINYINT(1) NOT NULL,
    transfer_duration VARCHAR(10),
    exception_type TINYINT(2) NOT NULL,
    agency_id INT(100),
    KEY `fare_id` (fare_id)
);

CREATE TABLE `fare_rules` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    fare_id VARCHAR(100),
    route_id VARCHAR(100),
    origin_id VARCHAR(100),
    destination_id VARCHAR(100),
    contains_id VARCHAR(100),
    KEY `fare_id` (fare_id),
    KEY `route_id` (route_id)
);

CREATE TABLE `feed_info` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    feed_publisher_name VARCHAR(100),
    feed_publisher_url VARCHAR(255) NOT NULL,
    feed_lang VARCHAR(255) NOT NULL,
    feed_start_date VARCHAR(8),
    feed_end_date VARCHAR(8),
    feed_version VARCHAR(100)
);

CREATE TABLE `frequencies` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    trip_id VARCHAR(100) NOT NULL,
    start_time VARCHAR(8) NOT NULL,
    end_time VARCHAR(8) NOT NULL,
    headway_secs VARCHAR(100) NOT NULL,
    exact_times TINYINT(1),
    KEY `trip_id` (trip_id)
);

CREATE TABLE `routes` (
    transit_system VARCHAR(50) NOT NULL,
    route_id VARCHAR(100),
    agency_id VARCHAR(50),
    route_short_name VARCHAR(50) NOT NULL,
    route_long_name VARCHAR(255) NOT NULL,
    route_type VARCHAR(2) NOT NULL, 
    route_text_color VARCHAR(255),
    route_color VARCHAR(255),
    route_url VARCHAR(255),
    route_desc VARCHAR(255),
    PRIMARY KEY (route_id),
    KEY `agency_id` (agency_id),
    KEY `route_type` (route_type),
    CONSTRAINT `agency_id` FOREIGN KEY (`agency_id`) REFERENCES `agency` (`agency_id`)
);

CREATE TABLE `shapes` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    shape_id VARCHAR(100) NOT NULL,
    shape_pt_lat DECIMAL(8,6) NOT NULL,
    shape_pt_lon DECIMAL(8,6) NOT NULL,
    shape_pt_sequence TINYINT(3) NOT NULL,
    shape_dist_traveled VARCHAR(50),
    KEY `shape_id` (shape_id)
);

CREATE TABLE `stops` (
    transit_system VARCHAR(50) NOT NULL,
    stop_id VARCHAR(255),
    stop_code VARCHAR(50),
    stop_name VARCHAR(255) NOT NULL,
    stop_desc VARCHAR(255),
    stop_lat DECIMAL(10,6) NOT NULL,
    stop_lon DECIMAL(10,6) NOT NULL,
    zone_id VARCHAR(255),
    stop_url VARCHAR(255),
    location_type VARCHAR(2),
    parent_station VARCHAR(100),
    stop_timezone VARCHAR(50),
    wheelchair_boarding TINYINT(1),
    PRIMARY KEY (stop_id),
    KEY `zone_id` (zone_id),
    KEY `stop_lat` (stop_lat),
    KEY `stop_lon` (stop_lon),
    KEY `location_type` (location_type),
    KEY `parent_station` (parent_station)
);

CREATE TABLE `trips` (
    transit_system VARCHAR(50) NOT NULL,
    route_id VARCHAR(100) NOT NULL,
    service_id VARCHAR(100) NOT NULL,
    trip_id VARCHAR(255),
    trip_headsign VARCHAR(255),
    trip_short_name VARCHAR(255),
    direction_id TINYINT(1), #0 for one direction, 1 for another.
    block_id VARCHAR(11),
    shape_id VARCHAR(11),
    wheelchair_accessible TINYINT(1), #0 for no information, 1 for at least one rider accommodated on wheel chair, 2 for no riders accommodated.
    bikes_allowed TINYINT(1), #0 for no information, 1 for at least one bicycle accommodated, 2 for no bicycles accommodated
    PRIMARY KEY (trip_id),
    KEY `route_id` (route_id),
    KEY `service_id` (service_id),
    KEY `direction_id` (direction_id),
    KEY `block_id` (block_id),
    KEY `shape_id` (shape_id),
    CONSTRAINT `route_id` FOREIGN KEY (`route_id`) REFERENCES `routes` (`route_id`),
    CONSTRAINT `service_id` FOREIGN KEY (`service_id`) REFERENCES `calendar` (`service_id`)
);

CREATE TABLE `stop_times` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    trip_id VARCHAR(100) NOT NULL,
    arrival_time VARCHAR(8) NOT NULL,
    arrival_time_seconds INT(100),
    departure_time VARCHAR(8) NOT NULL,
    departure_time_seconds INT(100),
    stop_id VARCHAR(100) NOT NULL,
    stop_sequence VARCHAR(100) NOT NULL,
    stop_headsign VARCHAR(50),
    pickup_type VARCHAR(2),
    drop_off_type VARCHAR(2),
    shape_dist_traveled VARCHAR(50),
    KEY `trip_id` (trip_id),
    KEY `arrival_time_seconds` (arrival_time_seconds),
    KEY `departure_time_seconds` (departure_time_seconds),
    KEY `stop_id` (stop_id),
    KEY `stop_sequence` (stop_sequence),
    KEY `pickup_type` (pickup_type),
    KEY `drop_off_type` (drop_off_type),
    CONSTRAINT `trip_id` FOREIGN KEY (`trip_id`) REFERENCES `trips` (`trip_id`),
    CONSTRAINT `stop_id` FOREIGN KEY (`stop_id`) REFERENCES `stops` (`stop_id`)
);

CREATE TABLE `transfers` (
    id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
    transit_system VARCHAR(50) NOT NULL,
    from_stop_id INT(100) NOT NULL,
    to_stop_id VARCHAR(8) NOT NULL,
    transfer_type TINYINT(1) NOT NULL,
    min_transfer_time VARCHAR(100)
);

我已将xyz_id键作为PRIMARY KEY放置在自己的表中,而在其他表中将其作为FOREIGN KEY. 我仍然需要对该模式进行一些优化.

I've put the xyz_id keys as PRIMARY KEY in their own table and as FOREIGN KEY in the others.
I still have some optimizations to make to this schema.

现在此查询将在1-5秒内起作用:

Now this query works in less than 1-5 seconds:

SELECT
    s.stop_id,
    (6371000*acos(cos(radians(48.1128135))*cos(radians(s.stop_lat))*cos(radians(-1.6470705)-radians(s.stop_lon))+sin(radians(48.1128135))*sin(radians(s.stop_lat)))) AS distance,
    t.route_id,
    st.*,
    t.*,
    r.*,
    c.*

FROM stop_times st

LEFT JOIN stops s USING (stop_id)
LEFT JOIN trips t USING (trip_id)
LEFT JOIN routes r USING (route_id)

LEFT JOIN calendar c ON c.service_id = t.service_id 

where
    c.start_date <= 20140915
    and c.end_date >= 20140915
    and c.sunday = 1

    and st.departure_time > '15:00:00'

HAVING
    distance < 200

ORDER BY st.departure_time ASC

这篇关于如何使我的GTFS查询运行得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆