Improving historical device data by granularity

Terdious · February 22, 2024, 5:48am

Improvement of device data history management: Integration of configurable granularity for state recording

Hello everyone,

I would like to propose an improvement aimed at optimizing how Gladys Assistant manages and stores the history of data received from connected devices. Currently, every state sent by a device for a given feature is recorded in the database, regardless of whether it has changed compared to the previous state. This method, while reliable for data retention, can lead to a rapid accumulation of redundant data, especially for devices that communicate at short intervals.

Current Issue: In the current context, a device sending its data every 30 seconds results in each received data point being recorded, even if it has not changed. This can quickly saturate the database with little-useful information, especially for data that require less fine granularity (for example, humidity or battery level that changes little).

Proposal: I propose allowing more flexible configuration of the recording granularity for each feature of a device. This could be done in two ways:

Adding a global setting in the system settings allowing to define a period during which, if the data has not varied, it would not be recorded in the database. This period could be adjusted according to the user’s needs or left empty to maintain the current behavior (systematic recording). Quite simple to implement.
Adding a specific column in the t_device_feature table to define the recording granularity per feature. This would offer maximum flexibility, allowing the recording frequency to be adjusted according to the nature and usefulness of each data point.

Advantages:

Significant reduction in the volume of stored data, by keeping only essential information.
Optimization of database performance, by limiting the number of writes.
Greater customization for the user, who can adapt the recording behavior according to the specifics of their home automation setup.

Example:
I installed a thermostat only 1 week ago. I want to remove it:

It already has 280,000 values even though it only polls every 2 minutes. And most values are identical. This would reduce the amount of data by at least 4 times (i.e., about 70,000).

cce66 · February 22, 2024, 2:25pm

@Terdious

Personally I think both solutions are best: having the global one and the specific column!
It would be nice to be able, without going through zigbee2mqtt, to set the data sending frequency values for sensors from Gladys as well (my system is down at the moment but it seems to me that the data sending for some sensors is configurable)

Darn, I wanted to vote but couldn’t find where to cast a vote — I’m raring to go, if someone can help me (I’m in useless mode at the moment, it’s driving me crazy! )

Terdious · February 29, 2024, 8:46am

Thanks @cce66 for your feedback on the question!

@pierre-gilles, do you have an opinion on the matter? Specifically about which of the proposed solutions to choose?
One? The other? Both? Other(s)

pierre-gilles · February 29, 2024, 3:10pm

Hi @Terdious

This isn’t the first time I’ve heard about this topic

My opinion hasn’t changed much, I find that both solutions complicate the product and are not good solutions:

Saving the same value multiple times is still information, it’s not useless.
If there’s really a storage issue, then it’s just a makeshift cache masking other problems. It’s better to fix those other problems.

Ok, that’s clearly an issue with the Netatmo integration then. The behavior needs to be reviewed. Can you create a specific thread so we can study this particular case? I want all the data on this topic to understand how a device can create 280k values in 1 week.

Terdious · February 29, 2024, 3:55pm

Read the proposal carefully. The proposal doesn’t require any changes compared to today. If a user isn’t interested in this feature/possibility, they don’t have to touch anything.
It’s simply about being able to do it when needed / out of interest.

You know I’m a heavy user. And I’m extremely interested in data and history (with @VonOx we had, during and after one of your live streams, a conversation about how to visualize the complete data with, for example, an external tool like ElasticSearch).

I’m convinced that it’s useless information in the database for 90% of features when that data doesn’t change. Apart from using webhooks, the point of doing « aggressive » polling (30 seconds to 5 minutes) is to capture changes. And most devices work by polling, not by webhooks.

I don’t think it’s a band-aid to be able to choose the data necessary to save with a bit more granularity. In many cases, data that doesn’t change is only needed at a given moment to have feedback on a data/sensor return issue.

That’s not necessary, there is no problem; as I explained, the poll is every 2 minutes, but with devices having 8 features and with a retrieval on restart it can happen to record many values. But that doesn’t change the issue being discussed.

For someone who already has 100 devices with at least 3 features each (it’s often more) and an average polling of 2 minutes (often less) we’re already at no less than 216,000 states per day. If, like me, you want to keep the data at all times… you’re clearly storing data for nothing. ^^

But indeed it would be interesting to get more feedback from other people on the subject!!

cce66 · February 29, 2024, 4:43pm

If the data doesn’t change, what’s the point of recording these non-events as long as there’s a report at a user-definable interval indicating the link isn’t down?
Recording all the data can be useful in case something unexpected comes up otherwise… unless aggregation later reduces the size of the DB by purging those useless records!

Many sensors are battery-powered (CR2032 for most), and

Jluc · February 29, 2024, 5:30pm

Hello,
This is off-topic but, given the number of values generated, I think it’s a shame not to be able to output them in hourly or daily charts in the history. I’m therefore taking the opportunity to follow up on the feature request " Retrieving charts and their 24-hour values from previous periods"

pierre-gilles · February 29, 2024, 8:04pm

@Terdious For me it’s a problem in the Netatmo integration.

Indeed, if the poll fetches a state and records it every 2 minutes even though there isn’t actually a new state, there’s no point in sending it to Gladys. But it’s up to the Netatmo integration to handle that, Gladys can’t know the difference between a « real state » and a « duplicate state », whereas the Netatmo integration can.

What does the Netatmo API return?

To me, you need to look in the returned data to see if there is a way to identify a state uniquely, and then in your Netatmo code it should be:

if(lastValueNetatmo \u003e lastValueGladys) {
     this.gladys.event.emit(EVENTS.DEVICE.NEW_STATE, {
         device_feature_external_id: feature.external_id,
         state: readValues[feature.category][feature.type](room.open_window),
     });
}

Edit: Okay, I went to check the Netatmo API, and it’s in there!

In each API call you have the « last_message » which I think is what we’re looking for

Just compare the last_message coming from Netatmo and the last_value_changed on the Gladys side, and presto you’re only recording new values.

This can be done in a few lines of code, it’s transparent to the user, and it addresses the issue exactly, and importantly without removing the states with the same values at different times. 15°C at 9:00 != 15°C at 10:00

What do you think?

Terdious · February 29, 2024, 10:15pm

@pierre-gilles, I’m sorry but I reiterate that this is off-topic. My question applies to all devices. Of all technologies. I have devices much worse than Netatmo. And yet in some cases, this is necessary. Of course I would prefer only on value change AND a single data point every 10 minutes for example to signal that the device is working. But let’s agree, no device does that. Except in the case of Gladys polling, and that would only do it for an entire device, not feature-by-feature. And in that case, you still have to choose a duration; that doesn’t change the identical data « which may be considered by the user as useless. »

Aside regarding Netatmo :
Yes the Netatmo API transmits all the info that allows managing all use cases very well

This therefore solves no problem, since:

The valves provide a real measurement value every 2 minutes (hence the 2-minute polling that was set up)
The weather stations every 5 minutes.

And the point is:

Yes it’s interesting to know that it’s still 15°C, but maybe for my use (I mean any user) it would be « sufficient » or more appropriate that I only keep identical values every 10 minutes, or even every 30 minutes. However it is important for my scenes and for my history that the change be visible every 2 minutes.
But then for the battery (data that will arrive at the same time) it can be interesting to know every 2 minutes if it has changed. However, if it has been at 23% for 3 days, I don’t care at all to know that for 3 days every 2 minutes it stayed at that level. What is important for me is to know:
- That during 3 days it did not drop dramatically (value change)
- That during 3 days the data did arrive (last_value). At most 1 point per day at 23% would be « nice »

And in your example as you say, today you would not allow (and I agree) removing the value without the user’s consent; maybe some will be interested in that value.

But yes, indeed for the weather station, I can delay updates to every 5 minutes.

But the question remains the same for certain data whether it’s every 2 minutes, every 5 minutes or every hour.

Back to the original topic concerning all features :

For example here (valve on Zigbee), I have 2 values whose precise history interests me but for the battery for example, the data that doesn’t move pollutes me.
And with the proposal that would not prevent me from knowing that this device is still available. I would be happy to know that it went to 96% in 5 days. But during those 5 days, I would have 1,440 useless data points in my DB. It’s roughly the same for signal strength even if it can still be interesting to know that the signal strength varies a lot.

Another very concrete example :

At Sonoff, the relay/sensor => Tasmota. I need consumption monitoring, especially nowadays, to study consumption after a period of time, compare with my solar production, etc. Among these 5 data points, some change very little, some are necessary for scenes, some for history. However the majority I am not interested in keeping for every identical value. Especially since we are at a 1-minute poll (very interesting for scenes related to solar) but then we can say that 75% of the data is useless => 5 features * 60 minutes * 24 = 7200 data per day!! And I have this kind of device everywhere in my home automation; in my purchases to reduce or optimize my consumption I have a real need for them.

What I mean is that use cases are diverse and varied. Yes with a motion sensor we are not too bothered. Yes with a light or equipment that only sends data on state changes, we’re not bothered.

But I am convinced that with this kind of optimization, Gladys would have a real advantage compared to other products. We can even highlight the environmental side (local storage / backups / huge reduction of database queries).

But for now that’s just my opinion. I continue recurring cleanups in the DB to remove this data at regular intervals => Cumbersome.

StephaneB · March 1, 2024, 9:23am

To add my two cents to the discussion: I agree with @Terdious’s analysis that it can be both useful to receive a sensor’s data in Gladys very regularly (to make sure everything is working, and that a value truly hasn’t changed (and that it’s not because the sensor stopped sending info)), and at the same time not to keep every value in the database if you want to control its size in the long term. And to be able to configure it per sensor, because the frequencies will vary depending on the sensor type and what you want to do with it…

That said, this is clearly an advanced setting with options that could overload the interface for the « standard » user, so it could be a Gladys parameter to be enabled deliberately by an experienced user, and which would only complicate the sensor configuration interface for them, if they choose to enable it.

Terdious · March 1, 2024, 2:10pm

Absolutely, I really want to stress that I’m aware of this and that the goal is by no means to bloat things. A single option in the global settings initially to allow access to this kind of feature-by-feature configuration afterwards.

And this setting will not even be seen by more than 50% of users, having no real impact, being aware that a home automation system in an apartment does not have the same needs as in a house with a yard.

However, this option can be fully explained in the documentation, and above all can be presented on the forum when a question on the subject is asked.
I also think it’s the kind of option that can be highlighted for « marketing ». Yes, Gladys is user-friendly and mindful of personal data security, but also concerned with keeping data storage lightweight while retaining maximum visibility into continuous consumption.

pierre-gilles · March 4, 2024, 8:49am

I’d be curious to have your sources for the data refresh frequency at Netatmo!

From what I read on other forums it’s every 10 minutes:

(Source: Netatmo weather: configure update frequency or get max values - Feature Requests - Home Assistant Community)

They also mention « terms of service », so we’d need to check that the « every 2 minutes » that’s in the current Gladys integration doesn’t violate Netatmo’s ToS.

Back to the original subject, for me this remains an issue clearly linked to integrations that do polling.

I understand the issue with the Netatmo integration, but I don’t entirely agree with the solution.

@Terdious maybe a short call would allow me to better explain what bothers me about your solution? Available in the late morning from 10:45 to 11:00 if you want. Otherwise asynchronously here, but I feel the debate is getting bogged down and taking too much time.

Terdious · March 4, 2024, 9:43am

Hi @pierre-gilles,

Thanks for your reply!
Well, I’m lucky to have most of the equipment and, above all, to be able to get my hands into the mess ^^
So my sources are simply my data (I had found the 2-minute story 2 years ago, but I can’t find it anymore — I have the impression that Netatmo no longer communicates about it — which had led me to run tests on this timing)

Look at the dates of your articles (2020 / 2022), I had indeed consulted them!! It’s mainly about the weather station (which I mentioned above as 5 minutes), and they were already talking back then about a possible change to go from 10 to 5 minutes.

Concretely:

Energy - Thermostat RF signal quality

image945×148 21.3 KB

We can clearly see a different value every 2 minutes
Indoor weather station - CO2 since it can vary a lot (10 minutes in your sources - normally 5 minutes):

image946×147 9.11 KB

Change from 866 to 842, then 4 minutes 10 seconds pass and we go to 823 (so we’re exactly at 5 minutes since the data change could have occurred within the 44s that elapsed for the shift from 866 to 842 => We have 3 different data points in 4 minutes 51s …
Weather station - Anemometer RF signal quality (distance makes it fluctuate):

image949×196 11.8 KB

We can see that in 7 minutes the value fluctuated across 5 values (8 values recorded here); during this period I restarted several times. In this case we’re well below 5 minutes (I can’t explain it but the data reveals it)

I’m sorry, but I insist that there is no problem with the Netatmo integration in particular; the « issue » (note I’m not mistaken, it’s not a bug in my head ^^) is real in many cases (see below: Zigbee / Sonoff)

If you’re available, I’d be happy to!

pierre-gilles · March 4, 2024, 12:59pm

Petit message pour récapituler notre call @Terdious C’était chouette de s’appeler pour clarifier un peu tout ça !

Je pense que c’est aux développeurs d’intégrations de mettre en place les bons mécanismes pour que leurs intégrations ne soient pas « trop agressives » sur la quantité de donnée qu’elles envoient à Gladys.

Dans le cas de Netatmo, je suis pour faire du cas par cas dans le code.

Exemple: Feature « batterie »: max 1 point par 30 minutes si pas de changement de valeur.

L’objectif est que nous trouvions les bonnes limites qui soient les plus précises sans compromettre la taille de la DB, ni les performances de Gladys.

Pour l’utilisateur, c’est transparent et ça fonctionne automatiquement, dans la philosophie du projet.

Sur le moyen terme, je suis de très près DuckDB, une base de donnée fichier (comme SQLite) mais adaptée aux données time-series. L’idée ne serait pas de remplacer le SQLite de Gladys ( qui est tout à fait adapté au reste de Gladys ), mais uniquement de transférer nos données time-serie sur un fichier dédié. D’après mes tests, la réduction en taille de DB pour des performances supérieures à ce qu’on fait actuellement est assez phénoménales ( Cf : Gestion plus fine de l'historique des états - #18 par pierre-gilles )

Pour l’instant je suis en stand-by sur ce développement car ils sont encore en beta (0.10.0), à voir si eux se considèrent comme « production-ready » ou pas.

lmilcent · March 23, 2024, 7:32pm

I can’t wait, my DB is 22GB, I’m going to do some cleanup

lmilcent · March 24, 2024, 7:14am

I changed the retention from « unlimited » to « 6 months » and started a cleanup.
With my database being 22 GB, I noticed the .wal file grew to 15 GB and caused the cleanup task to fail because the disk was full.

Could the cleanup task run in batches, with a VACUUM in between, to avoid this situation?
Should I open a feature request or not @pierre-gilles?

pierre-gilles · March 24, 2024, 11:03am

The cleanup task already runs in batches.

VACUUM is a very, very heavy and blocking task that can take up to 10 hours on slow disks, so it’s

pierre-gilles · March 25, 2024, 8:58am

A message has been split into a new topic: Disk full / help with database cleanup

pierre-gilles · March 25, 2024, 11:55am

For your information, I contacted the DuckDB community on their Discord:

pierre-gilles · March 29, 2024, 8:44am

I didn’t get any replies from DuckDB, however I found the answers to my questions in a video of a talk given by DuckDB last month

v1 would be released in the first half of 2024, so theoretically by July (unless delayed).

According to the talk:

It’ll be focused on stability and robustness
We just want to make sure that this is a DuckDB release that you can deploy on a satellite and fly to Venus without having to worry about patching your DuckDB instance

That’s very reassuring because it’s exactly what we’re looking for for Gladys: software so stable that it can run for 10 years without corruption and without human intervention.

Backward compatibility is necessary for Gladys (to be able to do updates), so it’s very positive that it’s one of their goals for v1.

The talk:

In short, all of this is super positive. For me DuckDB is the future of Gladys if everything they promise is verified I’ve already run tests, and it’s really really cool.

However, this objective must not block us on the other points we identified earlier in this discussion:

Topic		Replies	Views
Gestion plus fine de l'historique des états [Archive] Demande de fonctionnalités	25	513	November 20, 2023
Sauvegarder ou pas l'historique d'une fonctionnalités précise Développement	18	731	May 14, 2022
Lancement de Gladys Assistant 4 ! 🚀 Actualités	59	4466	January 25, 2021
[DB de 5Go] Lenteurs au premier démarrage Développement	26	1369	September 9, 2022
Problème de performance sur dashboard avec beaucoup de graphiques Configuration	52	2463	October 3, 2022

Improving historical device data by granularity

Related topics