I have been continuously logging GPS data for 6 years now, and thinking a lot about how I want to archive the data long-term. Currently I have the data in a MySQL database, which is not a good long-term solution.
In addition to some of the issues documented on the database antipattern page, some of the problems I've encountered are:
- The total data volume gets large after 6 years (there's already a million new rows in the last 6 months) making it hard to move the whole dataset around and back it up
- Raw database files on disk are not always portable between versions, so to upgrade the database I need to dump and restore the data from very large SQL text files
- Backing up the data as a SQL dump takes too long to do regularly since it locks the MySQL tables
- Adding a new column or index to the data is a long process that locks the table for potentially an hour (which of course is bad because I'm constantly generating new data)
Instead, I am considering how best to store the data in plain text files on disk. Below are some thoughts on the various options I'm considering, and would love to hear any feedback or other suggestions!
Folder Structure
I record at most one GPS point per second, so each day has a max of 86,400 records. I'll split the data into one file per day, in folders by year and month. UTC will be used to determine the filename date.
... 2014/ 2014/04/ 2014/04/29.json 2014/04/30.json 2014/05/ 2014/05/01.json 2014/05/02.json ...
Since each file holds only the data from one day, each has a max size of 18-20mb (see below for size estimates). This is somewhat large, but not unwieldy for processing since easily fits in RAM while reading, and can even be opened by most good text editors if needed.
Sharding the data into individual files means backing it up using a tool like rsync becomes very efficient since it's able to ignore entire files at a time.
Because the data is sharded, reading a day or week of data means only accessing a limited subset of the dataset. For example, accessing the data for May 1 Pacific time will mean opening both the 2014/04/30.json
and 2014/05/01.json
files.
GeoJSON Feature Collection
Pros
- 206 bytes per record
- Can be loaded into any GeoJSON viewer directly to display
Cons
- Appending requires parsing - must parse the existing data to add a new record
- An index would have to reference each record by byte ranges
- Reordering data would need to be done programmatically since it is hard to visually inspect
{ "type":"FeatureCollection", "features":[ { "type":"Feature", "properties":{ "date":"2011-09-19T00:02:07+0000", "speed":1, "accuracy":8, "altitude":8, "heading":0, "battery":90 }, "geometry":{ "type":"Point", "coordinates":[ -122.64768183231, 45.512098073959 ] } }, { "type":"Feature", "properties":{ "date":"2011-09-19T00:02:10+0000", "speed":1, "accuracy":6, "altitude":11, "heading":0, "battery":90 }, "geometry":{ "type":"Point", "coordinates":[ -122.6476174593, 45.512092709541 ] } }, { "type":"Feature", "properties":{ "date":"2011-09-19T00:02:10+0000", "speed":0, "accuracy":1000, "altitude":0, "heading":0, "battery":90 }, "geometry":{ "type":"Point", "coordinates":[ -122.66402777778, 45.517569444444 ] } }, { "type":"Feature", "properties":{ "date":"2011-09-19T00:02:12+0000", "speed":1, "accuracy":6, "altitude":9, "heading":0, "battery":90 }, "geometry":{ "type":"Point", "coordinates":[ -122.64757454395, 45.512087345123 ] } }, { "type":"Feature", "properties":{ "date":"2011-09-19T00:02:14+0000", "speed":1, "accuracy":4, "altitude":8, "heading":0, "battery":90 }, "geometry":{ "type":"Point", "coordinates":[ -122.64753699303, 45.512076616287 ] } } ] }
(Newlines and spacing for illustration purposes only, would not be included in the actual data)
Individual rows of GeoJSON Features
- Hand-modifying data is not as easy as the YAML option, but easier than the FeatureCollection option
Pros
- 206 bytes per record
- Append without parsing - can add data to the end of the file without parsing the rest
- An index could reference each record by line number
- Possible to reorder by hand since each record is on its own line
Cons
- Must parse each line and add to a GeoJSON Feature Collection in order to display
{"type":"Feature","properties":{"date":"2011-09-19T00:02:07+0000","speed":1,"accuracy":8,"altitude":8,"heading":0,"battery":90},"geometry":{"type":"Point","coordinates":[-122.64768183231,45.512098073959]}} {"type":"Feature","properties":{"date":"2011-09-19T00:02:10+0000","speed":1,"accuracy":6,"altitude":11,"heading":0,"battery":90},"geometry":{"type":"Point","coordinates":[-122.6476174593,45.512092709541]}} {"type":"Feature","properties":{"date":"2011-09-19T00:02:10+0000","speed":0,"accuracy":1000,"altitude":0,"heading":0,"battery":90},"geometry":{"type":"Point","coordinates":[-122.66402777778,45.517569444444]}} {"type":"Feature","properties":{"date":"2011-09-19T00:02:12+0000","speed":1,"accuracy":6,"altitude":9,"heading":0,"battery":90},"geometry":{"type":"Point","coordinates":[-122.64757454395,45.512087345123]}} {"type":"Feature","properties":{"date":"2011-09-19T00:02:14+0000","speed":1,"accuracy":4,"altitude":8,"heading":0,"battery":90},"geometry":{"type":"Point","coordinates":[-122.64753699303,45.512076616287]}}
GeoYAML
Pros
- Append without parsing - can add data to the end of the file without parsing the rest
- An index could reference each record by start/end line numbers
- Easy for a human to visually inspect the data and add/modify properties
- Possible to reorder by hand
Cons
- 232 bytes per record (slightly more than the JSON version because newlines are required)
- Requires parsing before displaying - must parse the YAML and convert back to GeoJSON in order to display
--- type: FeatureCollection features: - type: Feature properties: date: '2011-09-19T00:02:07+0000' speed: 1 accuracy: 8 altitude: 8 heading: 0 battery: 90 geometry: type: Point coordinates: - -122.64768183231 - 45.512098073959 - type: Feature properties: date: '2011-09-19T00:02:10+0000' speed: 1 accuracy: 6 altitude: 11 heading: 0 battery: 90 geometry: type: Point coordinates: - -122.6476174593 - 45.512092709541
* http://aaronparecki.com/articles/2014/06/01/1/long-term-archiving-of-gps-logs
* http://indiewebcamp.com/p3k#Publishing_Other_Content ...