Rebuilding IRONdb Nodes
If an IRONdb node or its data is damaged or lost, its data may be rebuilt from replicas elsewhere in the cluster. This process is known as “reconstituting” a node.
Reconstitution requires that at least one replica of every metric stream stored
on the reconstituting node be available. A reconstitute operation cannot
complete if more than
W-1 nodes are unavailable, including the node being
W is the number of
write_copies configured for the current
For example, given a cluster of 10 nodes (
N=10) with 3 write copies (
a node may be reconstituted if at least
N-(W-1), or 8, other nodes are
available and healthy.
As this can be a long-running procedure, a terminal multiplexer such as
screen is recommended to avoid interruption.
- Log into the IRONdb node you wish to reconstitute as
rootor a privileged user. Make sure the IRONdb package is up to date.
- Note: If the entire old node was replaced (e.g., due to a hardware failure), or
the ZFS pool has been recreated (due to hardware failure or administrative
action), then you should repeat initial
installation and then disable the
service. The installer will not interfere
with an existing
irondb.conffile but will ensure that all necessary ZFS datasets and node-id subdirectories have been created.
- Note: If reconstituting within the full, on-premise, Circonus Inside product, package updating has been handled automatically by the installer. No manual package installation is required. Please refer to the Circonus Inside Operations Manual for details on how this process differs for Circonus Inside.
- Make note of this node’s topology UUID, found in the imported
topology. You may need to reference this
configuration on another node if the node to be reconstituted is a fresh
install. The node UUID will be referred to below as
- If the IRONdb service is running, stop it.
- Make sure there is no lock file located at
/irondb/logs/snowth.lock. If there is, remove it with the following command:
rm -f /irondb/logs/snowth.lock
- If you repeated initial installation on this node, you may skip to the next step. Otherwise, follow this procedure to clean out any incomplete or damaged data.
- Run the following command to find the base ZFS dataset. This will create a
BASE_DATASET, that will be used in subsequent commands.
BASE_DATASET=$(zfs list -H -o name /irondb)
- Destroy the existing data using the following commands:
zfs destroy -r $BASE_DATASET/data zfs destroy -r $BASE_DATASET/text zfs destroy -r $BASE_DATASET/hist zfs destroy -r $BASE_DATASET/raw_db zfs destroy -r $BASE_DATASET/surrogate_db zfs destroy -r $BASE_DATASET/metric_name_db zfs destroy -r $BASE_DATASET/nntbs
- Wait for the data to be completely destroyed. To do this, periodically run the following command and wait until the value for all pools reads “0”.
zpool get freeing
- Recreate the dataset structure by running the following commands:
zfs create $BASE_DATASET/data zfs create $BASE_DATASET/hist zfs create $BASE_DATASET/text zfs create -o logbias=throughput $BASE_DATASET/raw_db zfs create -o logbias=throughput $BASE_DATASET/surrogate_db zfs create $BASE_DATASET/metric_name_db zfs create $BASE_DATASET/nntbs
- Run the following commands to make the node-id subdirectories:
mkdir /irondb/hist/<node_id> mkdir /irondb/text/<node_id> mkdir /irondb/data/<node_id> mkdir /irondb/raw_db/<node_id> mkdir /irondb/surrogate_db/<node_id> mkdir /irondb/metric_name_db/<node_id> mkdir /irondb/nntbs/<node_id>
- Make sure that all the directories are owned by the
nobodyuser by running the following:
chown -R nobody:nobody /irondb/
- Run IRONdb in reconstitute mode using the following command:
- Wait until the reconstitute operation has fetched 100% of its data from cluster peers. You can access the current percentage done at:
<node ip address>:<node port>/stats.json
…and looking at the “reconstitute” stats.
Note: There may not be messages appearing on the console while this runs. This is normal. Do not stop the reconstitute.
Current progress will be saved - if the process stops for any reason, everything should resume approximately where it was.
If the reconstitute is interrupted for any reason, you may resume it with the same command:
Once the reconstituting node has retrieved all of its data, you will see the following on the console: