If you observe nodes frequently restarting, confirm that the crashes are caused by OOM errors: The OPS logging channel will record a node_restart event whenever a node rejoins the cluster after being offline.Ī Prometheus alert can notify when a node has restarted more than once in the last 10 minutes. When deploying on Kubernetes, the kubectl get pods output contains a RESTARTS column that tracks the number of restarts for each CockroachDB pod. The Node status on the Cluster Overview page indicates whether nodes are online ( LIVE) or have crashed ( SUSPECT or DEAD). Nodes that frequently restart following an abrupt process exit may point to an underlying memory issue. The memory in use by CockroachDB processes.ĬockroachDB attempts to restart nodes after they crash. Monitor memory usage and node behavior for OOM errors: Metric or event
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |