Type: Bachelor project or Bachelor thesis. You should be fond of train schedules and have a basic understanding of GIS and geometrical operations. Programming language of your choice, but performance should be good enough to handle very big datasets.
Background info: Many transportation companies publish their timetable data either directly as GTFS feeds or in formats that can be converted to GTFS. As soon as you have two GTFS feeds (two sets of timetable data) that cover either the same or adjacent areas, the problem of duplicate trips arises. Consider a schedule published by Deutsche Bahn containing only trains, and a schedule published by the VAG Freiburg, containing busses, trams and the Breisgau-S-Bahn. The S-Bahn is contained in both datasets, but most probably with different IDs, different line names ("BSB" vs "Breisgau-S-Bahn" vs "Zug" vs ...), different station IDs, different station coordinates... consider an even more complicated example, where a train schedule published by DB contains a train from Amsterdam to Zürich. The train is contained in the DB-Feed from Amsterdam to Basel SBB (where it crosses the Swiss Border), but the part in Switzerland is missing. Another dataset, published by the Swiss Federal Railways, contains the same train, but starting only at Basel Bad Bf (the last German station) and ending at Zurich. A third dataset, published by the Nederlandse Spoorwegen, contains the train from Amsterdam to the first station in Germany. If you want to use all the three feeds together, several problems appear: the train is represented two times between Amsterdam and the first station in Germany, two times between Basel Bad Bf and Basel SBB and the information that you can travel through Basel SBB into Switzerland without having to change trains is completely lost.
Goal: Your input will be two or more GTFS-feeds, your output will be a single, merged GTFS feed that solves all the problems described above. Specifically, you should analyze the data, think of some equivalency measurements for trips (for example, if a train called "ICE 501" arrives in Basel Bad Bf at 15:44 in feed A and a train "ICE501" departs from Basel Bad Bf at 15:49 in feed B, it is most likely that this is the same train) and merge trips / routes that belong to the same vehicle. Another example: if two trains in A and B serve exactly the same stations at exactly the same arrival / departure times, this is also most likely the same train. You should think of some testing mechanism that makes sure that indeed every connection that was possible in feed A and feed B is still possible in the merged feed, that is no information was lost. Given some overlapping feed that appears in different qualities on both feeds, your tool should also automatically decide which (partial) representation has the better quality (for example, in feed A, no geographical information on the train route ('shape') is present, but in feed B, it is, so use the shape information from feed B). You tool should be able to handle huge datasets (for example, the entire schedule [trains, busses, trams, ferries etc.] of Germany).