Copyright © Bill Seymour 2023.
Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) |
This paper suggests an open source design for a very simple SQL database that can store enough information to generate timetables for Amtrak trains and timetable-like documents showing the on-time performance of particular trains on particular days. We could also use the data to generate historical on-time performance statistics or the likelihood of making connections.
This assumes that we’re using PostgreSQL which seems to have all the ISO-standard datetime types. (All the others that I know of are substandard, I guess for vendor lock-in reasons.) Assume that PostgreSQL’s IntervalStyle is set to sql_standard.
The stations and splits tables are intended as static lookup tables that probably never change. The trains, schedules, trips and consists tables are the interesting ones.
This paper is intended as just a brief description of what I have in mind. The actual DDL for creating tables and indices will be much more verbose.
This design is distributed under the Boost Software License, but it’s not part of Boost. The author just likes their open-source license.
CREATE TABLE stations ( station_code CHAR(3) NOT NULL PRIMARY KEY, station_name VARCHAR(50) NOT NULL, more_info VARCHAR(40), time_zone VARCHAR(44) NOT NULL ); CREATE TABLE trains ( train_nbr SMALLINT NOT NULL, effective_date DATE NOT NULL, discontinue_date DATE NOT NULL DEFAULT DATE '2099-12-31' train_name VARCHAR(25) NOT NULL, more_info VARCHAR(25), frequency VARCHAR(4) NOT NULL, binfreq SMALLINT NOT NULL, PRIMARY KEY (train_nbr, effective_date) ); CREATE INDEX train_name_idx ON trains(train_name); CREATE OR REPLACE FUNCTION frequency_trigger RETURNS trigger AS $frtr$ BEGIN IF NEW.frequency IS NULL THEN NEW.frequency := '0000'; END IF; IF NEW.frequency = '0000' THEN NEW.binfreq := 127; RETURN NEW; END IF; IF UPPER(SUBSTRING(NEW.frequency,1,1)) != 'X' THEN NEW.binfreq := 0; FOR idx IN 1..LENGTH(NEW.frequency) LOOP NEW.binfreq := NEW.binfreq | (1 << (TO_NUMBER(SUBSTRING(NEW.frequency,idx,1)) - 1)); END LOOP; RETURN NEW; END IF; NEW.binfreq := 127; FOR idx IN 2..LENGTH(NEW.frequency) LOOP NEW.binfreq := NEW.binfreq & ~(1 << (TO_NUMBER(SUBSTRING(NEW.frequency,idx,1)) - 1)); END LOOP; RETURN NEW; END; $frtr$ LANGUAGE plpgsql; CREATE OR REPLACE TRIGGER freq_trg BEFORE INSERT OR UPDATE OF frequency ON trains FOR EACH ROW EXECUTE FUNCTION frequency_trigger(); CREATE TABLE splits ( child_nbr SMALLINT NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES trains(train_nbr), parent_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trains(train_nbr) ); CREATE TABLE schedules ( train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trains(train_nbr), ordinal SMALLINT NOT NULL, station_code CHAR(3) NOT NULL FOREIGN KEY REFERENCES stations(station_code), effective_date DATE NOT NULL, discontinue_date DATE NOT NULL DEFAULT DATE '2099-12-31', arrival_time INTERVAL DAY TO MINUTE, departure_time INTERVAL DAY TO MINUTE, PRIMARY KEY (train_nbr, ordinal, effective_date), CHECK (departure_time IS NOT NULL OR arrival_time IS NOT NULL) ); CREATE VIEW timetables AS SELECT sk.effective_date, sk.discontinue_date, sk.train_nbr, sk.ordinal, sk.station_code, st.station_name, st.more_info, st.time_zone, EXTRACT(DAY FROM sk.arrival_time) + 1 AS scheduled_arrival_day, TO_CHAR(sk.arrival_time, 'HH24:MM') AS scheduled_arrival_time, EXTRACT(DAY FROM sk.departure_time) + 1 AS scheduled_departure_day, TO_CHAR(sk.departure_time, 'HH24:MM') AS scheduled_departure_time FROM schedules sk, stations st WHERE st.station_code = sk.station_code; CREATE TABLE trips ( departure_date DATE NOT NULL, train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trains(train_nbr), station_code CHAR(3) NOT NULL FOREIGN KEY REFERENCES stations(station_code), arrival_time INTERVAL DAY TO MINUTE, departure_time INTERVAL DAY TO MINUTE, comments VARCHAR(255), PRIMARY KEY (departure_date, train_nbr, station_code) ); CREATE TABLE trip_comments ( departure_date DATE NOT NULL FOREIGN KEY REFERENCES trips(departure_date), train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trips(train_nbr), comments VARCHAR(255), PRIMARY KEY (departure_date, train_nbr) }; CREATE VIEW performance AS SELECT tr.departure_date, tr.train_nbr, sk.ordinal, tr.station_code, st.station_name, st.more_info, st.time_zone, EXTRACT(DAY FROM sk.arrival_time) + 1 AS scheduled_arrival_day, TO_CHAR(sk.arrival_time, 'HH24:MM') AS scheduled_arrival_time, EXTRACT(DAY FROM tr.arrival_time) + 1 AS actual_arrival_day, TO_CHAR(tr.arrival_time, 'HH24:MM') AS actual_arrival_time, EXTRACT(DAY FROM sk.departure_time) + 1 AS scheduled_departure_day, TO_CHAR(sk.departure_time, 'HH24:MM') AS scheduled_departure_time, EXTRACT(DAY FROM tr.departure_time) + 1 AS actual_departure_day, TO_CHAR(tr.departure_time, 'HH24:MM') AS actual_departure_time, tr.comments AS stop_comment, tc.comments AS trip_comment FROM trips tr, schedules sk, stations st WHERE tr.train_nbr = sk.train_nbr AND tr.station_code = sk.station_code AND tr.departure_date BETWEEN sk.effective_date AND sk.discontinue_date AND st.station_code = tr.station_code LEFT OUTER JOIN trip_comments tc ON tc.departure_date = tr.departure_date AND tc.train_nbr = tr.train_nbr; CREATE TABLE consists ( departure_date DATE NOT NULL FOREIGN KEY REFERENCES trips(departure_date), train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trips(train_nbr), position SMALLINT NOT NULL, equipment_nbr VARCHAR(8) NOT NULL, line_nbr VARCHAR(4), description VARCHAR(50), PRIMARY KEY (departure_date, train_nbr, position) );
CREATE TABLE stations ( station_code CHAR(3) NOT NULL PRIMARY KEY, station_name VARCHAR(50) NOT NULL, more_info VARCHAR(40), time_zone VARCHAR(44) NOT NULL );Examples:
{ 'STL', 'St. Louis, MO', NULL, 'America/Chicago' } { 'FLG', 'Flagstaff, AZ', NULL, 'America/Phoenix' } { 'LAK', 'Lakeland, FL', 'Departure 92 North Arrival 91 South', 'America/New_York' } { 'LKL', 'Lakeland, FL', 'Departure 91 North Arrival 92 South', 'America/New_York' } { 'BOS', 'Boston, MA', 'South Station', 'America/New_York' } { 'BBY', 'Boston, MA', 'Back Bay', 'America/New_York' } { 'RTE', 'Boston, MA', 'Route 128', 'America/New_York' }
We’ll use IANA time zone names because, when generating performance documentation for particular trains, since we’d know the day and time, we can display the complete time zone abbreviation showing whether a station is observing winter or summer time. This will also make it easy to calculate elapsed running times between stations in different time zones if we ever need to do that; and we’ll have historical information about the time zone (like around winter/summer transitions).
We might be tempted to use just the time zone strings that appear in Amtrak’s published timetables, ET, CT, MT, PT, and MST; but then we’d lose all the historical information in the IANA database. (And note that some states are beginning to make noises about “permanent daylight saving time” which could result in further complications. IIRC, there’s even such a bill wending its way through the U.S. Senate.)
Note that there can be two codes for the same station if the station is visited twice by a single train, so we’ll have a more_info column that could be NULL or could explain what the difference is. more_info could also be used to tell which station we’re talking about in a city that has more than one station, for example, the three stations in and around Boston.
We can get all of this data, including the IANA time zone names, from Amtrak’s GTFS feed; but we might need some manual intervention to split Amtrak’s station names into the station_name and more_info columns. Fortunately, we’ll need to do that only once for the whole shebang when the table is initially loaded, and then maybe on a per-station basis in the rare instance when a new station gets added. Indeed, the initial load and maintenance of this table could be a DBA activity with regular users having only SELECT access.
CREATE TABLE splits ( child_nbr SMALLINT NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES trains(train_nbr), parent_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trains(train_nbr) );All that exist at present:
{ 27, 7 } { 28, 8 } { 421, 1 } { 422, 2 } { 448, 48 } { 449, 49 }This is intended mainly for loading the schedules and trips tables below when two trains split or recombine at some station and we don’t have the data for when they’re running as the same train. For example, the schedule for train 27 that we get from Dixieland Software includes only the Spokane through Portland stops. This table tells us that we can look at train 7 to get the data for SPK arrival and earlier stations when loading data for train 27. Similarly, we’d need to look at train 8 to get the data for SPK departure and later stations when loading data for train 28.
We need this only because of the incomplete data that we’ll get if we rely on Dixieland Software for schedules and trips.
As with the stations table, maintenance should probably be a DBA activity with regular users having only SELECT access.
CREATE TABLE trains ( train_nbr SMALLINT NOT NULL, effective_date DATE NOT NULL, discontinue_date DATE NOT NULL DEFAULT DATE '2099-12-31' train_name VARCHAR(25) NOT NULL, more_info VARCHAR(25), frequency VARCHAR(4) NOT NULL, binfreq SMALLINT NOT NULL, PRIMARY KEY (train_nbr, effective_date) ); CREATE INDEX train_name_idx ON trains(train_name);Examples:
{ 21, DATE '2015-04-06', DATE '2022-10-11, 'Texas Eagle', NULL, '0000', 127 /*1111111*/} { 21, DATE '2020-10-12', DATE '2022-05-17', 'Texas Eagle', NULL, '257', 82 /*1010010*/} { 21, DATE '2022-05-18', [default], 'Texas Eagle', NULL, '0000', 127 /*1111111*/} { 421, DATE '2015-04-06', [default], 'Texas Eagle', 'to Los Angeles', '257', 82 /*1010010*/} { 2150, DATE '2015-04-06', [default], 'Acela Express', NULL, 'X67', 31 /*0011111*/}
(2015-04-06 is the effective date of an old printed timetable that I have. I’m using it in the examples above to mean “long ago”.) |
This table just gives us a train name, maybe some additional information, and a frequency (the days of the week that a train is scheduled to depart its origin) given a train number and a date.
We might also want a non-unique index on train_name so that we can quickly get, for example, 21 and 421 when searching for “Texas Eagle”.
Trains can change their names (like Ann Rutledge, Lincoln Service, Missouri River Runner), and they can also change their frequencies (and did during COVID as in the train 21 example above), so we’ll need effective and discontinue dates.
Unfortunately, the discontinue date should probably be the last actual date that this row is in effect (rather than a one-past-the-end value) so that we can use SQL’s BETWEEN operator in WHERE clauses. They also can’t be NULL since all comparisons with NULL are false, so in this table and the schedules table that follows, we’ll use some obviously out-of-band value (like the end of 2099) to mean “until the end of time”.
Frequencies can be inferred from the schedule files that we get from Dixieland Software.
frequency is a string of at most four characters that’s widely used in the travel industry: '1' is for Monday, '7' is for Sunday. 'X' means “except”. Note that '1234' and 'X567' are two ways to write the same frequency. '0000' is a special value that means “daily”.
binfreq is just a binary version of the frequency: bit 0 (the value 1) is Monday, bit 6 (the value 64) is Sunday. The idea is to make it easy to do bitwise logical operations that are available in many programming languages.
I’m not aware of any DDL syntax to ensure consistency between frequency and binfreq, and that’ll probably be error-prone since most databases have no syntax for writing binary literals, so we’ll probably want a trigger that sets binfreq to the correct value before any INSERT or UPDATE of frequency. One way to write such a trigger is shown immediately below.
The following is written in PostgreSQL’s PL/pgSQL. Note that NEW.frequency is permitted to be NULL in which case it defaults to '0000' (daily).
CREATE OR REPLACE FUNCTION frequency_trigger RETURNS trigger AS $frtr$ BEGIN IF NEW.frequency IS NULL THEN NEW.frequency := '0000'; END IF; IF NEW.frequency = '0000' THEN NEW.binfreq := 127; RETURN NEW; -- NB: early return END IF; IF UPPER(SUBSTRING(NEW.frequency,1,1)) != 'X' THEN -- start with 0 and set a bit for each day NEW.binfreq := 0; FOR idx IN 1..LENGTH(NEW.frequency) LOOP NEW.binfreq := NEW.binfreq | (1 << (TO_NUMBER(SUBSTRING(NEW.frequency,idx,1)) - 1)); END LOOP; RETURN NEW; -- another early return END IF; -- freq begins with 'X' -- start with 1111111 and clear a bit for each day NEW.binfreq := 127; FOR idx IN 2..LENGTH(NEW.frequency) -- NB: start with position 2 LOOP NEW.binfreq := NEW.binfreq & ~(1 << (TO_NUMBER(SUBSTRING(NEW.frequency,idx,1)) - 1)); END LOOP; RETURN NEW; END; $frtr$ LANGUAGE plpgsql; CREATE OR REPLACE TRIGGER freq_trg BEFORE INSERT OR UPDATE OF frequency ON trains FOR EACH ROW EXECUTE FUNCTION frequency_trigger();
CREATE TABLE schedules ( train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trains(train_nbr), ordinal SMALLINT NOT NULL, station_code CHAR(3) NOT NULL FOREIGN KEY REFERENCES stations(station_code), effective_date DATE NOT NULL, discontinue_date DATE NOT NULL DEFAULT DATE '2099-12-31', arrival_time INTERVAL DAY TO MINUTE, departure_time INTERVAL DAY TO MINUTE, PRIMARY KEY (train_nbr, ordinal, effective_date), CHECK (departure_time IS NOT NULL OR arrival_time IS NOT NULL) );Example:
{ 2150, 0, 'NYP', DATE '2015-04-06', [default], NULL, INTERVAL '08:02' HOUR TO MINUTE } { 2150, 1, 'STM', DATE '2015-04-06', [default], NULL, INTERVAL '08:47' HOUR TO MINUTE } { 2150, 2, 'PVD', DATE '2015-04-06', [default], NULL, INTERVAL '10:57' HOUR TO MINUTE } { 2150, 3, 'RTE', DATE '2015-04-06', [default], NULL, INTERVAL '11:24' HOUR TO MINUTE } { 2150, 4, 'BBY', DATE '2015-04-06', [default], NULL, INTERVAL '11:33' HOUR TO MINUTE } { 2150, 5, 'BOS', DATE '2015-04-06', [default], INTERVAL '11:39' HOUR TO MINUTE, NULL }
This table holds the intended schedules for the trains. Schedules are subject to change, and there could be several for the same train number. This is why we have effective and discontinue dates.
The effective date is the earliest date that the train departs its origin, that is, the date when this schedule becomes the right one. Note that, on any given day, a train can have as many schedules as the number of days that it takes to get from its origin to its destination around any day that a schedule change goes into effect, for example, around winter/summer time changes. This affects any train that runs overnight.
The ordinal column is just a stop number that’s monotonically increasing from zero. We include this only for sorting the stops correctly.
The arrival and departure times are days, hours and minutes after midnight on the origin’s departure day, but eagerly adjusted to local wall clock time so that we don’t have to lazily figure time zone differences when displaying times of day to H. sapiens. (We’ll have to take time zones into account when computing elapsed time between stations, but it doesn’t seem like we’ll need to do that very often.)
CREATE VIEW timetables AS SELECT sk.effective_date, sk.discontinue_date, sk.train_nbr, sk.ordinal, sk.station_code, st.station_name, st.more_info, st.time_zone, EXTRACT(DAY FROM sk.arrival_time) + 1 AS scheduled_arrival_day, TO_CHAR(sk.arrival_time, 'HH24:MM') AS scheduled_arrival_time, EXTRACT(DAY FROM sk.departure_time) + 1 AS scheduled_departure_day, TO_CHAR(sk.departure_time, 'HH24:MM') AS scheduled_departure_time FROM schedules sk, stations st WHERE st.station_code = sk.station_code;
This can be used to generate the data needed to create one or more timetables that H. sapiens might understand. Presumably we’ll restrict on effective and discontinue dates and/or train numbers. Note that we add 1 to the days returned by the EXTRACT functions to make the days 1-based instead of 0-based.
The data won’t come up in the correct station order, so we'll need to do our own ORDER BY sk.ordinal.
Let’s say we want to create a timetable for the westbound Empire Builder to Seattle that’ll be correct a week from today:
SELECT * FROM timetables WHERE CURRENT_DATE + INTERVAL '7' DAYS BETWEEN effective_date AND discontinue_date AND train_nbr = 7 ORDER BY ordinal;
We could also get both the westbound and eastbound data for generating one of those read-down-read-up timetables by restricting on train_nbr IN (7,8) and sorting on train_nbr followed by ordinal. That’ll get us all of train 7 followed by all of train 8.
CREATE TABLE trips ( departure_date DATE NOT NULL, train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trains(train_nbr), station_code CHAR(3) NOT NULL FOREIGN KEY REFERENCES stations(station_code), arrival_time INTERVAL DAY TO MINUTE, departure_time INTERVAL DAY TO MINUTE, comments VARCHAR(255), PRIMARY KEY (departure_date, train_nbr, station_code) );Example:
{ DATE '2023-01-19', 2150, 'NYP', NULL, INTERVAL '08:04' HOUR TO MINUTE, NULL } { DATE '2023-01-19', 2150, 'STM', NULL, INTERVAL '08:53' HOUR TO MINUTE, NULL } { DATE '2023-01-19', 2150, 'PVD', NULL, INTERVAL '11:06' HOUR TO MINUTE, NULL } { DATE '2023-01-19', 2150, 'RTE', NULL, INTERVAL '11:27' HOUR TO MINUTE, NULL } { DATE '2023-01-19', 2150, 'BBY', NULL, INTERVAL '11:37' HOUR TO MINUTE, NULL } { DATE '2023-01-19', 2150, 'BOS', INTERVAL '11:43' HOUR TO MINUTE, NULL, NULL }This is a table of particular runs of trains that have arrived at their final destination, or at the the last stop before a service disruption, in the recent past.
This could also be used for trains that are currently running if it gets updated in real time. In this case, we might want to eagerly load the whole schedule for a particular train as soon as it departs its origin so that we’d need to do only UPDATEs, not INSERTs, when we get new data. If we do this, then most of the times will be NULL most of the time, so we can’t have the check constraint on departure and arrival times like we did in the schedules table. When displaying the data, we’ll join to the schedules table and ORDER BY trips.departure_date, trips.train, schedules.ordinal.
The departure_date column is the date that this particular trip departed its origin.
The arrival and departure times are the actual times. Get the scheduled times by joining to the schedules table and ORDER BY as stated above.
The comments column can be used for additional information like explaining service disruptions.
If we don’t want to keep historical data for very long, rows for departure_date/train_nbr pairs could be deleted whenever the trip becomes “old news”, for example, when the final stop’s departure_date + arrival_time is more than a year ago…or something.
CREATE TABLE trip_comments ( departure_date DATE NOT NULL FOREIGN KEY REFERENCES trips(departure_date), train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trips(train_nbr), comments VARCHAR(255), PRIMARY KEY (departure_date, train_nbr) };
This could be used if we have something special to say about a whole trip, as opposed to just one stop on a trip, for example, general statements about service disruptions.
This table will probably be mostly empty, but geeks who obsess over third normal form (this author raises his hand) will doubtless approve of its existence. 🌝
CREATE VIEW performance AS SELECT tr.departure_date, tr.train_nbr, sk.ordinal, tr.station_code, st.station_name, st.more_info, st.time_zone, EXTRACT(DAY FROM sk.arrival_time) + 1 AS scheduled_arrival_day, TO_CHAR(sk.arrival_time, 'HH24:MM') AS scheduled_arrival_time, EXTRACT(DAY FROM tr.arrival_time) + 1 AS actual_arrival_day, TO_CHAR(tr.arrival_time, 'HH24:MM') AS actual_arrival_time, EXTRACT(DAY FROM sk.departure_time) + 1 AS scheduled_departure_day, TO_CHAR(sk.departure_time, 'HH24:MM') AS scheduled_departure_time, EXTRACT(DAY FROM tr.departure_time) + 1 AS actual_departure_day, TO_CHAR(tr.departure_time, 'HH24:MM') AS actual_departure_time, tr.comments AS stop_comment, tc.comments AS trip_comment FROM trips tr, schedules sk, stations st WHERE tr.train_nbr = sk.train_nbr AND tr.station_code = sk.station_code AND tr.departure_date BETWEEN sk.effective_date AND sk.discontinue_date AND st.station_code = tr.station_code LEFT OUTER JOIN trip_comments tc ON tc.departure_date = tr.departure_date AND tc.train_nbr = tr.train_nbr;
This is basically the same thing as the timetables view above, but with the actual arrivals and departures, and the stop and trip comments, added. We could restrict on a particular station code for creating on-time performance statistics or guessing on the likelihood of making connections.
All the suggestions and caveats relating to the timetables view apply; and note the outer join on the trip_comments table since a matching row might not exist at all.
CREATE TABLE consists ( departure_date DATE NOT NULL FOREIGN KEY REFERENCES trips(departure_date), train_nbr SMALLINT NOT NULL FOREIGN KEY REFERENCES trips(train_nbr), position SMALLINT NOT NULL, equipment_nbr VARCHAR(8) NOT NULL, line_nbr VARCHAR(4), description VARCHAR(50), PRIMARY KEY (departure_date, train_nbr, position) );Example:
{ DATE '2023-01-21', 302, 0, '300', NULL, 'Siemens ALC-42' } { DATE '2023-01-21', 302, 1, '20160', NULL, 'Amfleet Cafe with Business Class Seating' } { DATE '2023-01-21', 302, 2, '54500', NULL, 'Horizon Coach' } { DATE '2023-01-21', 302, 3, '54501', NULL, 'Horizon Coach' } { DATE '2023-01-21', 302, 4, '54502', NULL, 'Horizon Coach' }
This is an optional table that might be of interest to some railfans. The examples above are totally made up. As I write this, I have no clue how to get this information.
position is just a number monotonically increasing from zero. Position zero is probably the lead engine unit, but it could be a cab car on a train running in push mode.
equipment_nbr uniquely identifies a particular piece of equipment in Amtrak’s roster of motive power or other rolling stock.
line_nbr, if we care, is probably the two LSDs of the train number followed by two digits identifying the particular car and its use in the consist for a particular trip. For example, the “ten car” can be the crew sleeper on some long distance trains. This is also the “car number” that passengers know. For example, train 49, the westbound Lake Shore Limited, might have two revenue sleepers, 4911 and 4912; and passengers in the sleepers are expected to know the number when they order their meals in the diner.
Note that line numbers aren’t unique because they lose the more significant train number digits and so don’t completely identify which train we’re talking about. For example, the Southwest Chief and a Lincoln Service train from Chicago to St. Louis could both have an '0310' car. (Do cars on corridor trains even have line numbers? This passenger has never had the need to be aware of them if they do.)