Release Notes
Changes in 1.2.1
2024-06-04
- Avoid metric index corruption by using
pread(2)
in jlog instead ofmmap(2)
. - Deprecate
max_ingest_age
from the graphite module. Require thevalidation
fields instead. - Change Prometheus module to convert
nan
andinf
records tonull
. - Add logging when when the
snowth_lmdb_tool
copy operation successfully completes. - Fix bug where a node could crash if we closed a raw shard for delete, then tried to roll up another shard before the delete ran.
- Fix bug where setting raw shard granularity values above
3w
could cause data to get written with incorrect timestamps during rollups. - Improve various listener error messages.
- Add checks for timeouts in the data journal path where they were missing.
- Improve graphite PUT error messages.
- Fix NNTBS rollup fetch bug where we could return no value when there was valid data to return.
- Fix bug where histogram rollup shards were sometimes not being deleted even though they were past the retention window.
Changes in 1.2.0
2024-03-27
NOTE: This release bumps the metric index version from 4 to 5. Upon restart, new indexes will be built and the old ones will be deleted. This process will use a significant amount of memory while the indexes are being rebuilt. It will also cause the first post-update boot to take longer than usual.
- Update index version from 4 to 5.
- Automatically clean up old index versions on startup to make sure outdated indexes don't clog the disk.
- Fix Ubuntu 20.04 specific bug where nodes could crash when trying to clean up status files when rolling up raw shards.
- Fix issue with level indexes where data was being lost when deleting metrics on levels where the metric has multiple tags.
- Fix issue where level indexes were incorrectly reporting that levels existed when all underlying metrics had been removed.
- Add new API endpoints,
/compact_indexes
and/invalidate_index_cache
, that allow forcing compaction and cache invalidation for specific accounts, respectively. - Fix rollup bug where raw shards could be prematurely deleted if a rollup was aborted due to corruption.
- Fix various potential memory corruption issues.
- Fix issue where jlog journal data could get corrupted.
- libmtev 2.7.1
Changes in 1.1.0
2024-01-02
- Add preliminary support for operating IRONdb clusters with SSL/TLS. This allows securing ingestion, querying, and intra-cluster replication. See TLS Configuration for details. This feature should be considered alpha.
- Fix bug where rollups were being flagged "not in progress" and "not dirty" when attempting to schedule a rollup and the rollup is already running.
- Use activity ranges as part of query cache key. Previously, cached results from queries with a time range could be used to answer queries that had no time range, leading to incorrect results.
- Fix logic bug where rollups were sometimes flagged as still being in progress after they were completed.
- Account index WAL can keep growing without bounds due to a local variable value being squashed early.
- Fix bug where the
reconst_in_progress
file was not being cleaned up after reconstitute operations, which could block rollups and deletes from running. - The
raw/rollup
andhistogram_raw/rollup
API endpoints will no longer block if there is a rollup already running. They will also return sensible error messages. - Raw shard rollups will not be allowed to run unless all previous rollups have run at least once.
- Fix bug where deferred rollups could cause the rollup process to lock up until the node is restarted.
- Add REST endpoint POST/PUT
/histogram/<period>/<check_uuid>/<metric_name>?num_records=<x>
which can be used with a json payload to directly insert histogram shard metric data for repair purposes. - Added configuration file field,
//reconstitute/@max_simultaneous_nodes
, that will cause the reconstituting node to only hit a predetermined number of peer nodes at once. The default if not specified is "all peers". This setting can be used if a reconstitute places too much load on the rest of the cluster, causing degradation of service. - Disallow starting single-shard reconstitutes with merge enabled if the shard exists and is flagged corrupt.
- Improve NNTBS error messages if unable to open a shard.
- PromQL - improve error messages on invalid or unsupported range queries.
- PromQL - fix range nested inside one or more instant functions.
- Include maintenance mode when pulling lists of raw, histogram, or histogram rollup shards.
- Use read-copy-update (RCU) for Graphite level indexes and the surrogate database. It allows more queries to run concurrently without affecting ingestion, and vice versa.
- Defer rollups of raw shards if there is a rollup shard in maintenance that the raw shard would write to.
- Reject live shard reconstitute requests on NNTBS or histogram rollup shards if there is a raw shard rollup in progress that would feed into them.
- Fix bug where the system would report that a live reconstitute was not in progress, even when one was running.
- Allow running single-shard or single-metric reconstitute on non-raw shards, even if the shard extends beyond the current time.
- The reconstitute GUI no longer apppears when doing online reconstitutes.
- Fix iteration bug when reconstituting NNTBS shards.
- Added the
merge_all_nodes
flag to live reconstitute which causes all available and non-blacklisted write copies to send metric data instead of only the "primary" available node. - Added the ability to repair a local database by reconstituting a single metric stream.
- Fix bug where
/fetch
would not proxy if the data for a time period was all in the raw database, but the relevant raw shards were offline.
Changes in 1.0.1
2023-09-06
NOTE: This version updates RocksDB (raw database, histogram shards) from version 6 to 7. It is not possible to revert a node to a previous version once this version has been installed.
- Add a new configuration parameter,
//rest/@default_connect_timeout
, that allows configuring the connect timeout for inter-node communication. This was previously hardcoded to 3 seconds. - Graphite
series
andseries_multi
fetches now return 500 when there are no results and at least one node had an issue returning data. - Graphite
series
andseries_multi
fetches now return 200 with an empty results set on no data rather than a 404. - Fix bug on
/find
proxy calls where activity ranges were being set incorrectly. - Add ability to filter using multiple account IDs in the
snowthsurrogatecontrol
tool by providing the-a
flag multiple times. - Reduce usage of rollup debug log to avoid log spam.
- Upgrade RocksDB from version 6.20.3 to version 7.9.2.
- libmtev 2.5.3
Changes in 1.0.0
2023-07-28
IMPORTANT NOTE: Any node running 0.23.7 or earlier MUST do a surrogate3 migration PRIOR to upgrading to 1.0.0. This is due to removal of support for the previous surrogate database format.
- Prevent ingestion stalls by setting a better eventer mask on socket read errors.
- Fix bug where surrogate deletes running at the same time as level index compactions can cause crashes.
- Improve scalability of lock protecting
all_surrogates
in indexes. - Fix logging of old-data ingestion.
- Don't stop a rollup in progress if new data is received; finish and reroll later.
- Add ability to filter by account ID in the
snowthsurrogatecontrol
utility by using the-a
flag. - Fix full-delete crash on malformed tag query.
- Rewrite Graphite level-index query cache to remove race on lookup.
- Remove surrogate2.
- Some data was hidden post arts compaction, make sure it stays visible.
- Fix bug where if a fetch to the
/raw
endpoint exceeded the raw streaming limit (10,000 by default), the fetch would hang. - Reduce memory usage during extremely large
/raw
fetches. - Fix bug where extremely large double values generated invalid JSON on
/raw
data fetches. - Handle surrogate duplicates on migration from
s2
tos3
. - Require all nodes for
active_count
queries. - Add back-pressure to raw puts, allows the database to shed load by returning HTTP 503.
- libmtev 2.5.2
Older Releases
For older release notes, see Archived Release Notes.