解析NYC Transit/MTA历史GTFS数据(非实时) [英] Parsing NYC Transit/MTA historical GTFS data (not realtime)

查看:150
本文介绍了解析NYC Transit/MTA历史GTFS数据(非实时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几个月来,我一直在迷惑不解,找不到解决方案.

I've been puzzling on this on and off for months and can't find a solution.

MTA声称这里以GTFS格式以每日转储的形式提供历史数据: [ http://web.mta.info /developers/MTA-Subway-Time-historical-data.html][1]

The MTA claims to provide historical data in form of daily dumps in GTFS format here: [http://web.mta.info/developers/MTA-Subway-Time-historical-data.html][1]

通过下载他们提供的示例(在本例中为2014年9月17日)来亲自查看: [ https://datamine-history .s3.amazonaws.com/gtfs-2014-09-17-09-31] [1]

See for yourself by downloading the example they provide, in this case Sep, 17th , 2014: [https://datamine-history.s3.amazonaws.com/gtfs-2014-09-17-09-31][1]

我的问题吗?该文件是gobbledygook.它不遵循GTFS规范,没有扩展名,当我使用文本编辑器打开它时,它看起来像是7800行:

My problem? The file is gobbledygook. It does not follow GTFS specifications, has no extension, and when I open it using a text editor it looks like 7800 lines of this:

n ^ C1.0 ^ X 枪 ^Eʞ>` ^ C1.0 ^ R ^ K ^ A1 ^ R ^ F ^P ^ E ^ R ^ K ^ A2 ^ R ^ F ^P ^ E ^ R ^ K ^ A3 ^ R ^ F ^P ^ E ^ R ^ K ^ A4 ^ R ^ F ^P ^ E ^ R ^ K ^ A5 ^ R ^ F ^P ^ E ^ R ^ K ^ A6 ^ R ^ F ^P ^ E ^ R ^ K ^ AS ^ R ^ F ^P ^ E ^ R [ ^ F000001 ^ ZQ 6 ^ N050400_1..S02R ^ Z ^ H20140917 * ^A1 > ^ V ^ P01 0824 242/SFY ^ P ^ A ^ X ^ C ^ R ^ W ^ R ^ F ^Pɚ ^ E"^D140Sʚ> ^ F ^ AA ^ R ^ AA ^ RR ^ F000002"H 6

n ^C1.0^X �枪�^Eʞ>` ^C1.0^R^K ^A1^R^F^P����^E^R^K ^A2^R^F^P����^E^R^K ^A3^R^F^P����^E^R^K ^A4^R^F^P����^E^R^K ^A5^R^F^P����^E^R^K ^A6^R^F^P����^E^R^K ^AS^R^F^P����^E^R[ ^F000001^ZQ 6 ^N050400_1..S02R^Z^H20140917*^A1�>^V ^P01 0824 242/SFY^P^A^X^C^R^W^R^F^Pɚ��^E"^D140Sʚ>^F ^AA^R^AA^RR ^F000002"H 6

每个MTA站点(显示为不正确)

Per the MTA site (appears untrue)

所有数据均以GTFS实时格式

All data is formatted in GTFS-realtime

对将这个神秘文件转换为可用的GTFS数据所需的步骤有任何想法吗?我缺少一些编码吗?我已经寻找了10个以上的对象,但无法提出解决方案.

Any idea on the steps necessary to transform this mystery file into usable GTFS data? Is there some encoding I am missing? I have looked for 10+ and been unable to come up with a solution.

另外,不要固执己见,但我指的不是MTA的实时数据Feed,该Feed的格式正确且可用.我特别指的是我上面提到的历史数据转储(已经收到了许多只针对实时数据馈送的解决方案")

Also, not to be a stickler but I am NOT referring to the MTA's realtime data feed, which is correctly formatted and usable. I am specifically referring to the historical data dumps I reference above (have received many "solutions" referring only to realtime data feed)

推荐答案

您链接到的文件是GTFS实时格式,而不是GTFS,并且您链接到的页面在解释其数据使用哪种格式方面做得很糟糕.实际上(尽管您的报价中已提及).

The file you link to is in GTFS-realtime format, not GTFS, and the page you linked to does a very bad job of explaining which format their data is actually in (though it is mentioned in your quote).

GTFS用于存储日程表数据,例如路线和预定的到达时间.

GTFS is used to store schedule data, like routes and scheduled arrival times.

GTFS-realtime通常用于实时传输实际的运输绩效数据,例如车辆位置以及预期或实际到达时间.它是一个protobuf,是Google公开的针对已编译二进制数据的规范,这意味着您无法在文本编辑器中有效地读取它,而必须使用Google protobuf工具以编程方式加载它.通过公开提供GTFS-rt提要的每日转储,可以将其用作MTA此处的历史数据格式.之所以称为GTFS实时,是因为实时地将诸如route_idtrip_idstop_id的各种数据字段链接到已发布的GTFS时间表.

GTFS-realtime is generally used to transfer actual transit performance data in real-time, like vehicle locations and expected or actual arrival times. It is a protobuf, a specification for compiled binary data publicized by Google, which means you can't usefully read it in a text editor, but you instead have to load it programmatically using the Google protobuf tools. It can be used as a historical data format in the way MTA is here, by making daily dumps of the GTFS-rt feed publicly available. It's called GTFS-realtime because various data fields in the realtime like route_id, trip_id, and stop_id are designed to link to the published GTFS schedules.

通过使用gtfs-realtime.proto规范和适用于Python的Google protobuf工具对数据进行反编译,我确认了所链接数据的有效性.它开始于:

I confirmed the validity of the data you linked to by decompiling it using the gtfs-realtime.proto specification and the Google protobuf tools for Python. It begins:

header {
  gtfs_realtime_version: "1.0"
  timestamp: 1410960621
}
entity {
  id: "000001"
  trip_update {
    trip {
      trip_id: "050400_1..S02R"
      start_date: "20140917"
      route_id: "1"
    }
    stop_time_update {
      arrival {
        time: 1410960713
      }
      stop_id: "140S"
    }
  }
}
...

并沿该静脉继续进行总共55833行(采用默认的字符串输出格式).

and continues in that vein for a total of 55833 lines (in the default string output format).

编辑:用于将protobuf转换为字符串表示形式的Python脚本非常简单:

EDIT: the Python script used to convert the protobuf into string representation is very simple:

import gtfs_realtime_pb2 as gtfs_rt

f = open('gtfs-rt.pb', 'rb')
raw_str = f.read()

msg = gtfs_rt.FeedMessage()
msg.ParseFromString(raw_str)

print msg

这需要使用protocgtfs-realtime.proto编译成gtfs_realtime_pb2.py(按照 Python protobuf文档(位于编译协议缓冲区"下),并且与Python脚本位于同一目录中.此外,从MTA下载的二进制protobuf必须命名为gtfs-rt.pb,并且与Python脚本位于同一目录中.

This requires gtfs-realtime.proto to have been compiled into gtfs_realtime_pb2.py using protoc (following the instructions in the Python protobuf documentation under "Compiling Your Protocol Buffers") and placed in the same directory as the Python script. Furthermore, the binary protobuf downloaded from the MTA needs to be named gtfs-rt.pb and located in the same directory as the Python script.

这篇关于解析NYC Transit/MTA历史GTFS数据(非实时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆