Live Coding: Discovering DuckDB on Thursday, June 20 at 10 AM!

pierre-gilles · June 19, 2024, 4:24pm

Hi everyone

I’ve already mentioned it on the forum — DuckDB is a revolutionary OLAP database that has just reached 1.0.

For Gladys, this database is very promising because it matches our needs exactly for the « graphing » part:

Extremely high-performing on analytical data
File-based database, like SQLite
Significant compression to reduce disk space usage

I propose a live workshop on YouTube where I’ll test DuckDB with you, to see together if it can address our challenges.

This will take place tomorrow at 10:00 here:

https://www.youtube.com/live/EtEfyS6uHoE?si=pwiTbXfGvI6395KA

See you tomorrow!

guim31 · June 19, 2024, 6:54pm

I won’t be available but I think the initiative is great!!!

pierre-gilles · June 20, 2024, 7:42am

See you in 20 minutes on YouTube !

https://www.youtube.com/live/EtEfyS6uHoE?si=pwiTbXfGvI6395KA

pierre-gilles · June 20, 2024, 10:13am

Thank you all for coming, the replay is available at the same address

GBoulvin · June 20, 2024, 10:51am

In summary, does DuckDB live up to its promises?
Database size, access speed, etc.?
Verdict and impressions?

guim31 · June 20, 2024, 11:02am

VERY impressive the graph’s display speed over the year!!!

@GBoulvin I can’t be precise because I only had a quick look, but to display data (about 1000) over a 1-year period:

with sqlite we’re at 6 sec latency
with duckdb we’re at 0.12 sec

Anyway… it’s fast!!!

GBoulvin · June 20, 2024, 11:20am

I only saw part of it too, hence my question

Regarding storage, I thought I heard a ratio of about 1/3… Is that right?
So, if it’s faster and

Hizo · June 20, 2024, 11:46am

@guim31: At that time he was testing with 3 years of data because over 1 year with 300 data points you couldn’t really see the difference.
That’s why he wanted to send 10 million data points to get a better view.

@GBoulvin: No, the difference is very significant,
at the start of

GBoulvin · June 20, 2024, 1:45pm

My bad !
The difference is such that I thought I had misheard (or that @pierre-gilles had misread ) and that

Hizo · June 20, 2024, 1:58pm

You made me doubt it, so:
![image|623x216](upload://rHF4FUAzzVtM35tw

pierre-gilles · June 20, 2024, 3:57pm

Small recap post

The live recap is very positive:

At the end of the live, the DB was 1.3GB for the current implementation vs 33 MB on DuckDB
DuckDB is extremely efficient at computing aggregated data “live” directly on the raw data. We can run the aggregation process only every hour.
One limitation, however, in DuckDB: max 1 client per DB. No possibility to read a DB that is already open — maybe not convenient for development but not an issue for Gladys.

There are still things to test, however:

Does DuckDB support the same write volume as SQLite? DuckDB isn’t the best at writing; it’s made to read/analyze large volumes, but not necessarily designed for heavy write workloads. We may need to use periodic batch writes (e.g., 1 write/second max)
Is DuckDB as stable/robust as SQLite? SQLite is one of the most robust programs on earth (Their testing methodology is quite impressive, for those interested: How SQLite Is Tested). DuckDB, for its part, just released 1.0 last week. I think we need to run our own real-world tests and give DuckDB a bit of time to fix all the small bugs that will be discovered after this 1.0 release. There will probably be a 1.0.1. For now, Gladys’ stability seems “established,” but a premature switch to DuckDB could jeopardize that stability; to me we need to be very rigorous at this stage. We’re not in a hurry — for the end user there will be little difference anyway
Investigate the connection system: is it better to have one connection per DuckDB client, or a write connection/a read connection, or a pool of N connections? Test in real conditions with concurrent writes/reads.
Investigate the backup/restore process: without downtime or with interruption?

Next, on the development side:

Properly code the entire write/read system (this morning’s live was only a PoC to test, nothing serious)
Implement backup and restoration of the database via Gladys Plus. How to remain backward-compatible with existing backups?
Implement migration of all existing data to DuckDB, and deletion of the data in SQLite. How to force a VACUUM without cutting Gladys off for too long for existing users? Goal: minimal service interruption.

In short, this is very promising but there’s real work to do I’m very motivated to do it, but it will take me time.

However, I’ll need you fairly soon to help me test with your data (outside of your production environments, of course).

pierre-gilles · June 21, 2024, 8:37am

Another fairly impressive test this morning: I created a CSV of 50 million rows across 5 sensors over the last 3 years with random values.

The file is exactly 3.0 GB:

pierregilles  staff 3.0G Jun 21 10:33 output.csv

I made a small script that imports the CSV using DuckDB’s CSV import API:

CREATE TABLE device_feature_state AS
    SELECT * FROM read_csv('output.csv');

The CSV was ingested in 3.34s:

[ { Count: 50000000n } ]
start: 3.348s

The DuckDB file is 75 MB!

pierregilles  staff 75M Jun 21 10:34 imported_from_csv.duckdb

And it indeed contains the 50 million rows:

guim31 · June 21, 2024, 9:01am

This difference is crazy

Hizo · June 21, 2024, 9:51am

But how is it possible to go from 3GB to 75MB…?!
This compression is crazy…

pierre-gilles · June 21, 2024, 12:14pm

@Hizo In the CSV there is a lot of redundancy — out of the 50 million rows, there are only 5 different sensor UUIDs which are repeated, so each one 10 million times.

Any compression algorithm sees these kinds of patterns and eliminates most of the weight by keeping each UUID only once. From 50 million to 5 rows The rest are DOUBLEs (8 bytes) and DATETIMEs; there may also be redundancy in the generated values.

Hizo · June 21, 2024, 12:27pm

Yeah, that’s not wrong,
basically, he’s building a UUID table and linking the two tables?
And could he do the same thing for the values?

pierre-gilles · June 21, 2024, 12:40pm

Everything is explained on this page:

In our case, this is the part that should have the most impact on this particular file:

pierre-gilles · June 24, 2024, 8:48am

@GBoulvin @Terdious (or someone else) If you ever want to send me your DBs, I’m starting to run migration tests

Could you send them to me in a DM? (via whatever link you want)

pierre-gilles · June 24, 2024, 2:23pm

Thanks @GBoulvin for your database.

For now, it’s 13 GB with a little over 13 million sensor states.

I tested my migration code to DuckDB, which completed in 12 minutes on my machine (I think there’s still optimization to do).

By moving to DuckDB, your SQLite DB is now only 7.7 MB and your DuckDB DB is 68.2 MB, for a total of 75.9 MB!

GBoulvin · June 24, 2024, 2:40pm

Really impressive!
On an RPi, what time factor should we estimate for the transcription? The time during which I imagine Gladys would be inaccessible… Should we provide a button to migrate manually if/when the user decides?
I might wait before resizing my partitions then

Topic		Replies	Views
Gladys Assistant 4.45 : DuckDB, une révolution dans Gladys ! ⚡ Actualités	106	942	September 14, 2024
26 août 2024 : Mise à jour majeure de Gladys! Actualités	25	759	August 30, 2024
Gladys Assistant v4.8.6, avec un nouvel algorithme de sauvegardes Gladys Plus! Actualités	17	992	May 21, 2022
Suite Bug backup Gladys Plus -> Bug global Gladys Gladys Plus	21	1609	March 29, 2022
Problème de performance sur dashboard avec beaucoup de graphiques Configuration	52	2465	October 3, 2022

Live Coding: Discovering DuckDB on Thursday, June 20 at 10 AM!

Related topics