This section contains general instructions for troubleshooting various issues.
Each section in this manual under “Roles & Services” includes notes on troubleshooting procedures specific to that role, and on how to find log files that can assist with troubleshooting.
- Broker Statuses
- Fault Detection Troubleshooting
- JLOG Error Troubleshooting
- Troubleshooting Alerts
- Broker-Stratcon Connectivity Troubleshooting
PKI Connectivity Troubleshooting
The following roles make use of SSL to communicate:
In each role’s section the Operations Manual, you can find details on where the keys and certificates are located. Once you have those locations, troubleshooting an SSL connection can proceed.
- If for any reason you are not receiving certificates, either when installing
Circonus or when adding new services or brokers, check the status of the
circonus-ca_processorservice on the host in the
caservice role. This service watches the Postgres database for PKI-related tasks, and automatically renews certificates and the CRL in advance of their expiration. Certificates are stored in Postgres as well as on the local filesystem. Restarting the service will cause it to sign any pending certificate signing requests (CSRs) and then begin listening again for new entries.
- Verify that all the necessary keys and certificates exist. These will be
<application>.key. If any are missing, refer to the install manual and run
run-hooperagain on this node, optionally with
-mto prevent upgrading any packages.
- Verify that the
ca.crtmatches what is provided by your CA. To do this, log into the CA host and look at
- Verify that the certificate was signed by the CA by using the following command:
openssl verify -CAfile /path/to/ca.crt /path/to/application.crt
- Verify that the key matches the certificate. If the following two commands
don’t output the same value, there is a mismatch:
openssl x509 -noout -modulus -in /path/to/application.crt | openssl md5
openssl rsa -noout -modulus -in /path/to/application.key | openssl md5
- Verify connectivity with the following command:
openssl s_client -connect host:port -CAfile /path/to/ca.crt -cert /path/to/application.crt -key /path/to/application.key
- If all brokers appear to be disconnected from stratcon, yet everything has
valid certificates, the issue could be the Certificate Revocation List
(CRL). Stratcon uses this to ensure that certificates for decommissioned
brokers can no longer be used to communicate with the core system. The
ca_processor service automatically generates the CRL with a validity period
of 30 days. A weekly cron job on stratcon hosts refreshes the CRL. The
validity of the CRL can be checked with the following command:
openssl crl -in /opt/noit/web/stratcon/pki/ca.crl -noout -nextupdate
- If the
nextupdatedate is in the past, Stratcon will refuse to connect with any broker. The refresh command is in root’s crontab, and you can use
sudo crontab -lto see it. This command may be run at any time.
- If the
If any of the above commands fail for non-obvious reasons, contact Circonus Support (email@example.com) about how to resolve the issue.
In the event that a check is not returning data when you believe it should, the following steps should be taken:
- Verify the running status of the check on the broker by following these steps:
- Navigate to the “Check Details” page on the UI and click the “Extended Details” link in the upper left section of the page. Record the UUID shown there.
- Log onto the broker machine and telnet to port 32322 using this command:
telnet localhost 32322
- Show the status of the check by typing this command, using the UUID from Step 1:
show check <UUID>
- If the check is getting an error, such as a refused connection or a timeout, verify the connectivity of the broker to the machine in question using system tools like telnet, curl, etc.
- If all these steps are showing the check should be working, collect the network traffic to and from the broker for inspection. If possible, you can use a tool like tcpdump or snoop to collect this network traffic.
Repairing Corrupt LevelDB Data Stores
On occasion, a LevelDB database may become corrupted.
You should be able to determine which log is corrupted by looking at the errorlog (usually in /snowth/logs/errorlog). It will tell you what has been corrupted. To fix it, follow the instructions below.
1. Disable snowthd.
Before you start, you will need to disable snowthd with the following command:
sudo systemctl stop circonus-snowth
2a. Correct corrupted text data.
There are two DBs that can become corrupted in the text db - the metrics store (a list of metrics) and the changelog (all of the different text values for a metric).
To correct the metrics store, run the following:
sudo /opt/circonus/sbin/snowthd -u nobody -g nobody \ -r text/metrics \ -i <id of snowth node in topology> \ -c /opt/circonus/etc/snowth.conf
To correct the changelog, run the the following:
sudo /opt/circonus/sbin/snowthd -u nobody -g nobody \ -r text/changelog \ -i <id of snowth node in topology> \ -c /opt/circonus/etc/snowth.conf
2b. Correct corrupted histogram data.
For histogram data, the metrics db (a list of all available histogram metrics) or the actual data (which is stored based on the period) can become corrupted.
To fix the metrics database, run the following:
sudo /opt/circonus/sbin/snowthd -u nobody -g nobody \ -r hist/metrics \ -i <id of snowth node in topology> \ -c /opt/circonus/etc/snowth.conf
To fix the actual data, run the following:
sudo /opt/circonus/sbin/snowthd -u nobody -g nobody \ -r hist/<period> \ -i <id of snowth node in topology> \ -c /opt/circonus/etc/snowth.conf
3. Renable snowthd.
Once finished, you will need to renable snowthd with the following commands:
sudo systemctl start circonus-snowth