Making GTFS Shape Data Manageable

One of the files that may be included by transit agencies in GTFS feeds is the shapes.txt file. This file is used to describe the various paths that bus or trains (or any route type for that mater) take on their trips.

Every record in shapes.txt corresponds to a single point for a single shape. Some shapes may consist of thousands of points. While it’s awesome that some agencies provide this level of detail, some devices may struggle to represent this data efficiently.

One of the challenges I faced with TransitTimes was the memory constraints of iOS devices. Rendering a polyline with 2000 points is just not feasible using iOS MKMapView.

The following image shows a portion of a shape from Transperth’s GTFS feed. This shape is made up of 400 points (the entire shape is actually 1700 points – I cut this down for simplicity).

As you might imagine, an iPhone 3G struggled immensely trying to draw this map, and once drawn it was almost impossible to actually scroll around the map. (In actual fact, the iPhone 4 also struggled).

This level of detail has no benefit to the user – especially if they can’t interact with their device.

The following image shows the exact same shape, but this time it’s made up of 70 points. This is far more manageable for mobile devices.

To automate this reduction of points, I used the Douglas-Peucker algorithm. This is a recursive algorithm that finds the point furthest away orthogonally from two other points.

The following image demonstrates this.

If this point is within a given tolerance, the point is discarded. If it’s outside the tolerance, the point is kept and the algorithm continues. When it continues, the algorithm is called twice: once for the line connecting the original start point and the new point, and again, this time for the line connecting the new point and the original end point.

I’ve written an article on PhpRiot about implementing this algorithm in PHP: Reducing a Map Path Using Douglas-Peucker Algorithm.

Using this algorithm, I’ve managed to remove upwards of 95% of shape points from GTFS files I’ve processed, without any significant loss of quality.

The other point I haven’t touched on: removing more than 95% of shape data also results in significant savings of space. For instance, the shapes.txt file for Transperth is over 40 MB in size. Reducing this file by 95% would make the file about 2 MB.

What Is GTFS?

Since this blog will dedicate many posts to General Transit Feed Specification – GTFS – let me provide an introduction.

Firstly, from the official documentation:

The GTFS transit feed specification defines a common format for public transportation schedules and associated geographic information.

The “G” in GTFS is commonly thought of as standing for Google, since they created this specification so they can display transit info on Google Maps.

Transit agencies distribute (well, provide a link on their web site typically) to a ZIP file that contains a number of CSV (comma-separated values) files – each of which is described at the above URL.

Agencies distribute updates to their feeds at their own intervals. For instance, Auckland provide one month of data at a time, while others may provide 6 or 12 months worth of data.

The GTFS format isn’t overly strict, meaning:

  • There’s no fixed format for identifiers (such as route_id or trip_id)
  • Agencies tend to omit fields as they please
  • Agencies sometimes add fields that aren’t in the spec (this doesn’t really matter – you can choose to use the extra info provided, or simply have your CSV parser ignore it)
  • Many agencies tend to misuse the calendar.txt and calendar_dates.txt files

There’s no versioning included in GTFS. Each agency has adopted GTFS at different stages of its lifetime, so they build their files. For instance, when Perth adopted GTFS, the direction_id field in trips.txt wasn’t part of the spec. For a given route, “inbound” and “outbound” of the same route are treated as separated entries in the routes.txt file.

In later posts I’m going to discuss some of the various differences and challenges associated with the feeds from the feeds of various cities.

Hopefully this brief introduction gives you some idea of GTFS. It’s a somewhat simple specification, although some aspects are hard to comprehend until you’ve dealt with the data for a while.

Introduction

TransitTimes is an iOS application available for many different cities that provides offline timetable and trip planning functionality for public transit systems.

This presents many challenges, such as:

  • Every city’s public transit system is structured differently. Some have trains, others have subways, others have ferries, some have all of the above. All (so far) have buses
  • Every city that provides an open data feed (using a spec such as GTFS) structures their data differently.
  • Big cities have a lot of data
  • Mobile devices have limited storage space and computing power

This blog will discuss these challenges and how TransitTimes deals (or may deal) with them.

Essentially, it’s a blog about programming, but primarily geared at public transit systems.