@guim31: At that time he was testing with 3 years of data because over 1 year with 300 data points you couldn’t really see the difference.
That’s why he wanted to send 10 million data points to get a better view.
@GBoulvin: No, the difference is very significant,
at the start of
At the end of the live, the DB was 1.3GB for the current implementation vs 33 MB on DuckDB
DuckDB is extremely efficient at computing aggregated data “live” directly on the raw data. We can run the aggregation process only every hour.
One limitation, however, in DuckDB: max 1 client per DB. No possibility to read a DB that is already open — maybe not convenient for development but not an issue for Gladys.
There are still things to test, however:
Does DuckDB support the same write volume as SQLite? DuckDB isn’t the best at writing; it’s made to read/analyze large volumes, but not necessarily designed for heavy write workloads. We may need to use periodic batch writes (e.g., 1 write/second max)
Is DuckDB as stable/robust as SQLite? SQLite is one of the most robust programs on earth (Their testing methodology is quite impressive, for those interested: How SQLite Is Tested). DuckDB, for its part, just released 1.0 last week. I think we need to run our own real-world tests and give DuckDB a bit of time to fix all the small bugs that will be discovered after this 1.0 release. There will probably be a 1.0.1. For now, Gladys’ stability seems “established,” but a premature switch to DuckDB could jeopardize that stability; to me we need to be very rigorous at this stage. We’re not in a hurry — for the end user there will be little difference anyway
Investigate the connection system: is it better to have one connection per DuckDB client, or a write connection/a read connection, or a pool of N connections? Test in real conditions with concurrent writes/reads.
Investigate the backup/restore process: without downtime or with interruption?
Next, on the development side:
Properly code the entire write/read system (this morning’s live was only a PoC to test, nothing serious)
Implement backup and restoration of the database via Gladys Plus. How to remain backward-compatible with existing backups?
Implement migration of all existing data to DuckDB, and deletion of the data in SQLite. How to force a VACUUM without cutting Gladys off for too long for existing users? Goal: minimal service interruption.
In short, this is very promising but there’s real work to do I’m very motivated to do it, but it will take me time.
However, I’ll need you fairly soon to help me test with your data (outside of your production environments, of course).
@Hizo In the CSV there is a lot of redundancy — out of the 50 million rows, there are only 5 different sensor UUIDs which are repeated, so each one 10 million times.
Any compression algorithm sees these kinds of patterns and eliminates most of the weight by keeping each UUID only once. From 50 million to 5 rows The rest are DOUBLEs (8 bytes) and DATETIMEs; there may also be redundancy in the generated values.
Really impressive!
On an RPi, what time factor should we estimate for the transcription? The time during which I imagine Gladys would be inaccessible… Should we provide a button to migrate manually if/when the user decides?
I might wait before resizing my partitions then