Live Coding: Discovering DuckDB on Thursday, June 20 at 10 AM!

Hi everyone :slight_smile:

I’ve already mentioned it on the forum — DuckDB is a revolutionary OLAP database that has just reached 1.0.

For Gladys, this database is very promising because it matches our needs exactly for the « graphing » part:

  • Extremely high-performing on analytical data
  • File-based database, like SQLite
  • Significant compression to reduce disk space usage

I propose a live workshop on YouTube where I’ll test DuckDB with you, to see together if it can address our challenges.

This will take place tomorrow at 10:00 here:

https://www.youtube.com/live/EtEfyS6uHoE?si=pwiTbXfGvI6395KA

See you tomorrow!

6 Likes

I won’t be available but I think the initiative is great!!! :grin:

1 Like

See you in 20 minutes on YouTube !

https://www.youtube.com/live/EtEfyS6uHoE?si=pwiTbXfGvI6395KA

Thank you all for coming, the replay is available at the same address :slight_smile:

In summary, does DuckDB live up to its promises?
Database size, access speed, etc.?
Verdict and impressions?

VERY impressive the graph’s display speed over the year!!! :expressionless:

@GBoulvin I can’t be precise because I only had a quick look, but to display data (about 1000) over a 1-year period:

  • with sqlite we’re at 6 sec latency
  • with duckdb we’re at 0.12 sec

Anyway… it’s fast!!!

2 Likes

I only saw part of it too, hence my question :wink:

Regarding storage, I thought I heard a ratio of about 1/3… Is that right?
So, if it’s faster and

1 Like

@guim31: At that time he was testing with 3 years of data because over 1 year with 300 data points you couldn’t really see the difference.
That’s why he wanted to send 10 million data points to get a better view.

@GBoulvin: No, the difference is very significant,
at the start of

3 Likes

My bad !
The difference is such that I thought I had misheard (or that @pierre-gilles had misread :innocent:) and that

You made me doubt it, so:
![image|623x216](upload://rHF4FUAzzVtM35tw

1 Like

Small recap post :slight_smile:

The live recap is very positive:

  • At the end of the live, the DB was 1.3GB for the current implementation vs 33 MB on DuckDB
  • DuckDB is extremely efficient at computing aggregated data “live” directly on the raw data. We can run the aggregation process only every hour.
  • One limitation, however, in DuckDB: max 1 client per DB. No possibility to read a DB that is already open — maybe not convenient for development but not an issue for Gladys.

There are still things to test, however:

  • Does DuckDB support the same write volume as SQLite? DuckDB isn’t the best at writing; it’s made to read/analyze large volumes, but not necessarily designed for heavy write workloads. We may need to use periodic batch writes (e.g., 1 write/second max)
  • Is DuckDB as stable/robust as SQLite? SQLite is one of the most robust programs on earth (Their testing methodology is quite impressive, for those interested: How SQLite Is Tested). DuckDB, for its part, just released 1.0 last week. I think we need to run our own real-world tests and give DuckDB a bit of time to fix all the small bugs that will be discovered after this 1.0 release. There will probably be a 1.0.1. For now, Gladys’ stability seems “established,” but a premature switch to DuckDB could jeopardize that stability; to me we need to be very rigorous at this stage. We’re not in a hurry — for the end user there will be little difference anyway :slight_smile:
  • Investigate the connection system: is it better to have one connection per DuckDB client, or a write connection/a read connection, or a pool of N connections? Test in real conditions with concurrent writes/reads.
  • Investigate the backup/restore process: without downtime or with interruption?

Next, on the development side:

  • Properly code the entire write/read system (this morning’s live was only a PoC to test, nothing serious)
  • Implement backup and restoration of the database via Gladys Plus. How to remain backward-compatible with existing backups?
  • Implement migration of all existing data to DuckDB, and deletion of the data in SQLite. How to force a VACUUM without cutting Gladys off for too long for existing users? Goal: minimal service interruption.

In short, this is very promising but there’s real work to do :flexed_biceps: I’m very motivated to do it, but it will take me time.

However, I’ll need you fairly soon to help me test with your data (outside of your production environments, of course).

3 Likes

Another fairly impressive test this morning: I created a CSV of 50 million rows across 5 sensors over the last 3 years with random values.

The file is exactly 3.0 GB:

pierregilles  staff 3.0G Jun 21 10:33 output.csv

I made a small script that imports the CSV using DuckDB’s CSV import API:

CREATE TABLE device_feature_state AS
    SELECT * FROM read_csv('output.csv');

The CSV was ingested in 3.34s:

[ { Count: 50000000n } ]
start: 3.348s

The DuckDB file is 75 MB!

pierregilles  staff 75M Jun 21 10:34 imported_from_csv.duckdb

And it indeed contains the 50 million rows:

3 Likes

This difference is crazy :smiling_face_with_sunglasses:

1 Like

But how is it possible to go from 3GB to 75MB…?!
This compression is crazy…

@Hizo In the CSV there is a lot of redundancy — out of the 50 million rows, there are only 5 different sensor UUIDs which are repeated, so each one 10 million times.

Any compression algorithm sees these kinds of patterns and eliminates most of the weight by keeping each UUID only once. From 50 million to 5 rows :smiley: The rest are DOUBLEs (8 bytes) and DATETIMEs; there may also be redundancy in the generated values.

Yeah, that’s not wrong,
basically, he’s building a UUID table and linking the two tables?
And could he do the same thing for the values?

Everything is explained on this page:

In our case, this is the part that should have the most impact on this particular file:

1 Like

@GBoulvin @Terdious (or someone else) If you ever want to send me your DBs, I’m starting to run migration tests :slight_smile:

Could you send them to me in a DM? (via whatever link you want)

2 Likes

Thanks @GBoulvin for your database.

For now, it’s 13 GB with a little over 13 million sensor states.

I tested my migration code to DuckDB, which completed in 12 minutes on my machine (I think there’s still optimization to do).

By moving to DuckDB, your SQLite DB is now only 7.7 MB and your DuckDB DB is 68.2 MB, for a total of 75.9 MB!

2 Likes

Really impressive!
On an RPi, what time factor should we estimate for the transcription? The time during which I imagine Gladys would be inaccessible… Should we provide a button to migrate manually if/when the user decides?
I might wait before resizing my partitions then :innocent: