[help needed] Problems accessing Gladys

Hello,
first big problem: I have a major Gladys crash :frowning:
I don’t know if it’s related but I finished the consumption calculation 30 minutes after the start (around 12:00), then I launched the cost calculation (about 1 hour) and after refreshing the page, I can no longer access Gladys.
I restarted the docker: still not working
I cleaned the docker and the gladys image and relaunched my docker compose: still not working
I have the logs if needed.

Second problem which seems to come from the Gladys Plus backup at 3:00 AM.
From what I observed on my proxmox, the RAM and swap of my LXC start to grow until saturation and then Gladys stops responding.
A forced reboot of my LXC restores a normal state afterwards.
However I don’t have any logs because when it crashes I no longer have access.
And in my backups, I see that the last one is from 5 days ago and I indeed had to restart my LXC in the past few days.

Thanks in advance for your help.

Yes, I’d like the logs!

3.6 GB, if you have a huge number of states, that’s not that big, right?

Ah damn, I’d appreciate information about that.

@Terdious also noticed a memory leak somewhere, their RAM usage also increases abnormally!

I just checked my Gladys instance, and I see the issue too, but on a smaller scale.

So either there’s a bug in the energy tracking implementation that’s causing a memory leak, or DuckDB has a memory leak in the version I installed, since I updated DuckDB for energy tracking.

I’ll investigate!

Hello !!

As I told you on the call, I’m convinced that, for my part at least, I already had the issue before the energy tracking was added. I noticed it at the beginning of November, but I couldn’t say exactly since when!!

I don’t have the issue on my professional instance, which I haven’t updated for about a year and a half.

1 Like

I still opened a pull request to update DuckDB:

You never know!

1 Like

At the beginning of November I can think of a few things:

  • Either the MCP server
  • Or the Matter integration

Is there any way we could help you find it?

It’s also been a while that I’ve had these issues, but I couldn’t say when they started.
At first I thought it was the backups of my LXC in Proxmox that were also at 3 AM, but since I disabled it last night, I saw that it was linked to the Gladys Plus backup at 3 AM.
And because sometimes it worked, I didn’t pay attention.

So I just checked and I started having these problems before mid-August this year.
I don’t remember exactly when I activated Gladys Plus but I think it was in July.

Aside from running tests on your side, not much more than that :slight_smile: We need to find which part of Gladys is responsible.

I’m running tests on my side!

I think there are two distinct issues! Here, we’re talking about a memory leak in Gladys, not really related to Gladys Plus, I think, because I can reproduce the leak during the day even over 30 minutes (without any backup).

We can then look at whether the RAM usage of the backups can be optimized, but I don’t think that’s the problem here :slight_smile:

1 Like

First test: MCP integration?

I paused the MCP integration, and restarted Gladys

(@bertrandda, while I’m at it, for now the MCP integration doesn’t have a stop function and therefore stopping the service only has an effect when restarting!)

a few minutes later, RAM usage had already doubled, so I think the MCP service is not the cause

1 Like

For info, I did quite a few tests with ChatGPT. I notably restarted the gladys prod container with the following option to create snapshots.
So with Gladys stopped, I’m at 4.03 GB of memory in use.
Once Gladys finished starting, I’m at 5.7 GB of memory (15:43).
image
Until 16:00 I stayed in that range, with only a few fluctuations. But you can see the memory is being freed properly.
At 16:00 the consumption / cost calculation did occur with a slight memory increase but was immediately freed. That confirms it’s not that. Tasks completed.
1 minute later, I go to the dashboard (I couldn’t say if that’s the trigger) and poof I instantly jumped to 8.82 GB: Column « RES » (= reserved memory) goes from 1748MB to 4995MB. => Nothing is freed
image

I’ll paste raw what ChatGPT tells me, and we can see after if you want anything else (Conversation: https://chatgpt.com/share/693d82d6-93e8-800b-8686-ad0b41963f18)

Yes. Here we clearly see what’s happening, and you were right from the start: this is not a JS heap leak, it’s a native memory explosion (outside V8), triggered by a one-off operation.

I’ll give you a precise, factual readout based on the “peak” report you just provided.


:one: What the report says at the time of the peak (numeric evidence)

JS Heap (V8)

"usedMemory": 252011200        ≈ 240 Mo
"externalMemory": 25243016     ≈ 24 Mo

:right_arrow: Heap perfectly stable, almost identical to previous snapshots.
:right_arrow: No JS leak.


Total process memory

"rss": 6198140928        ≈ 5,77 Go
"maxRss": 6374838272     ≈ 5,93 Go

:right_arrow: +4 GB outside the JS heap that appeared abruptly.


Linux side (/proc)

RssAnon:   5 089 920 kB  ≈ 4,85 Go
RssFile:       59 392 kB
Private_Dirty: 5 090 240 kB

:right_arrow: Private anonymous memory
:right_arrow: Not file-mapped
:right_arrow: Not freed afterwards

:backhand_index_pointing_right: Typical signature:

massive native allocations (malloc / new on the C/C++ side)


:two: What it is NOT (important)

:cross_mark: Not:

  • V8
  • Garbage Collector
  • Retained JS objects
  • Timers or blocked event loop
  • Classic Node leak

Your snapshots prove it:
the heap stays at ~260 MB while RSS explodes to ~6 GB.


:three: What it is (almost certain at this stage)

Let’s look at the loaded native bindings:

"/src/server/node_modules/duckdb/lib/binding/duckdb.node"
"/src/server/node_modules/sqlite3/lib/binding/node_sqlite3.node"
"/src/server/services/usb/.../bindings-cpp.node"
"/src/server/services/bluetooth/.../bluetooth_hci_socket.node"

Primary suspect #1: DuckDB

This is THE main suspect, for several reasons:

  1. DuckDB allocates heavily in native memory
  2. Uses:
    • vectorized execution
    • columnar buffers
    • intermediate caches
  3. Doesn’t always return memory to the OS immediately
  4. Can grow by several GB on a single query

:backhand_index_pointing_right: And especially:

The peak is instantaneous, high CPU, memory never goes back down
→ exactly the observed behavior.


:four: Why it doesn’t go back down

Very important to understand:

  • DuckDB logically frees its buffers
  • BUT:
    • malloc() keeps memory in the arena
    • the RSS does not go down
    • Node will reuse this memory later, but Linux still sees it as « taken »

:right_arrow: So:

This is not an infinite leak
It’s a stepwise irreversible increase


:five: The likely trigger in your setup

Given your Gladys + Energy Monitoring + history setup:

  • energy aggregation
  • temporal calculations
  • analytical queries
  • possibly:
    • daily consolidation
    • aggregates per phase / device
    • historical recalculation

:right_arrow: A heavy DuckDB query was launched (cron, internal job, UI action).

This matches:

  • CPU working
  • RAM rising in 1 second
  • no visible JS allocations

:six: Absolute confirmation (if you want to be 100% sure)

During the next peak, run immediately:

docker exec gladys bash -lc '
cat /proc/1/smaps | awk "
/duckdb.node/ {f=1}
^$/ {f=0}
f && /Rss:/ {print}
"
'

If you see hundreds of MB / GB associated with DuckDB → final verdict.


:seven: What to do concretely (realistic solutions)

Option A – Limit DuckDB (recommended)

In Gladys (code):

  • limit query sizes
  • avoid global aggregates on the full history
  • chunk by period (day / week)

Option B – Force a Node memory cap

:warning: This does not fix the root cause, but prevents host OOM:

-e NODE_OPTIONS="--max-old-space-size=2048 ..."

:right_arrow: Note: this does not limit native memory, but avoids stacking the two.


Option C – Scheduled restart (pragmatic)

If the peaks are rare but destructive:

docker restart gladys
  • weekly
  • or after a heavy task

This is very common for DuckDB in production.


:eight: Clear and concise conclusion

:check_mark: Your data is excellent
:check_mark: The diagnosis is now solid

:backhand_index_pointing_right: There is no JS memory leak
:backhand_index_pointing_right: RAM is consumed by native allocations (very likely DuckDB)
:backhand_index_pointing_right: Memory doesn’t go back down because it remains in the native arena

If you want, next possible steps:

  • identify which Gladys query triggers the peak
  • or help you patch / work around on the Gladys side (SQL logging, throttle, split)

See you later!

Thanks for your investigations, I arrive at the same conclusions!!

I’ve done the same investigations on my side, and likewise it’s clearly not a JS issue, the heap size is contained. It’s native code that’s causing the problem!

DuckDB — I updated to the latest version, and I also seem to have issues.

You can test by installing gladysassistant/gladys:dev which runs DuckDB v1.4.3.

So:

  • Either it’s still an unresolved DuckDB bug to this day
  • Or it’s something else

I’m going to test, and to confirm it I went to my

You’re right !!

Screenshot 2025-12-13 at 16.35.08

However, it’s not a bug, it’s a feature :joy:

1 Like

But clearly, 80% is too much.

We could move to a lower percentage, I think…

1 Like

Hehe that’s what I thought, so ^^

Hopefully reducing this number won’t ruin performance in your case, because if you put less in RAM, it’ll use the disk…

There may be some query optimizations to make, even though the queries are extremely simple in this case

Let me know when you have a test image, I’ll run the test to give you feedback. We can hope that the NVMe drives are fast enough.

Are we doing pagination right now?

The PR :

I’ve put 30% for now, which still seems quite high to me, but well it’s already a big step down from 80%..

We can talk optimization in another thread :slight_smile:

2 Likes

The image is live on gladysassistant/gladys:set-duckdb-memory-limit

I’m testing it on my machine

1 Like

I can only test it tomorrow, I didn’t see the time…!! Sorry

1 Like