This week-end project started by browsing the open-data repository of Paris’ public transport network, which contains various APIs to query real-time departures, current disruptions, etc.The data reuse section caught my eye, as it features external projects that use this open data.In particular, the RATP status website provides a really nice interface to visualize historical disruptions on metro, RER/train and tramway lines.
It all depends on how it’s represented on disk though and how the query is executed. Sqlite only supports numbers and strings, and if you keep using a VARCHAR, a read of those rows are going to have materialize a string into memory inside the sqlite library. DuckDB has more types, but if you’re using varchars everywhere, something has to read that string into memory unless you can push down logic into a query that doesn’t actually have to read the actual value, such as one that can use indices.
The best way is to change the representation on disk, such as converting low-cardinality columns like the station into a numeric id. A standard int being four bytes is a lot more efficient than an n-byte string + a header and it can be compared by value.
This is where file formats, like Parquet, shine. They’re oriented more towards parsing by systems. JSON is geared towards human parsing.
It all depends on how it’s represented on disk though and how the query is executed. Sqlite only supports numbers and strings, and if you keep using a
VARCHAR
, a read of those rows are going to have materialize a string into memory inside the sqlite library. DuckDB has more types, but if you’re using varchars everywhere, something has to read that string into memory unless you can push down logic into a query that doesn’t actually have to read the actual value, such as one that can use indices.The best way is to change the representation on disk, such as converting low-cardinality columns like the
station
into a numeric id. A standardint
being four bytes is a lot more efficient than an n-byte string + a header and it can be compared by value.This is where file formats, like Parquet, shine. They’re oriented more towards parsing by systems. JSON is geared towards human parsing.