Normal disk graph

2/18/2023

If you observe nodes frequently restarting, confirm that the crashes are caused by OOM errors: The OPS logging channel will record a node_restart event whenever a node rejoins the cluster after being offline.Ī Prometheus alert can notify when a node has restarted more than once in the last 10 minutes. When deploying on Kubernetes, the kubectl get pods output contains a RESTARTS column that tracks the number of restarts for each CockroachDB pod. The Node status on the Cluster Overview page indicates whether nodes are online ( LIVE) or have crashed ( SUSPECT or DEAD). Nodes that frequently restart following an abrupt process exit may point to an underlying memory issue. The memory in use by CockroachDB processes.ĬockroachDB attempts to restart nodes after they crash. Monitor memory usage and node behavior for OOM errors: Metric or event

See additional memory recommendations in the Production Checklist.
Over-allocating memory on production machines can lead to unexpected performance issues when pages have to be read back into memory. For more details, see the Production Checklist. Doing so increases the risk of memory-related failures. Avoid setting -cache and -max-sql-memory to a combined value of more than 75% of a machine's total RAM.
For production deployments, set -cache to 25% or higher.
Provision at least 4 GiB of RAM per vCPU.
Provision enough memory and allocate an appropriate portion for data caching: Category However, frequent node restarts caused by out-of-memory (OOM) crashes can impact cluster stability and performance. You can continue to monitor the cluster via the Prometheus endpoint and logs.ĬockroachDB is resilient to node crashes. If the cluster becomes unavailable, the DB Console and Cluster API will also become unavailable.
Over time, an unhealthy LSM and cluster instability.
If workload concurrency exceeds CPU resources, you will observe: For more details, see Sizing connection pools. The latest QPS value for the cluster is also displayed with the Queries per second counter on the Metrics page.Įxpected values for a healthy cluster: At any time, the total number of actively executing SQL statements should not exceed 4 times the number of vCPUs in the cluster. The SQL Statements graph on the Overview and SQL dashboards shows the 10-second average of SELECT, UPDATE, INSERT, and DELETE statements being executed per second on the cluster or node. The number of concurrent active SQL statements should be proportionate to your provisioned CPU. If CPU usage is high, check whether workload concurrency is exceeding CPU resources. Because this metric does not reflect system CPU usage, values above 80% suggest that actual CPU utilization is nearing 100%. The CPU Percent graph on the Hardware and Overload dashboards shows the CPU consumption by the CockroachDB process, and excludes other processes on the node.Įxpected values for a healthy cluster: CPU utilized by CockroachDB should not persistently exceed 80%. If latencies are consistently high, check for:Ĭompaction on the storage layer uses CPU to run concurrent worker threads. This time does not include returning results to the client.
The Service Latency: SQL Statements, 99th percentile and Service Latency: SQL Statements, 90th percentile graphs on the SQL dashboard show the time in nanoseconds between when the cluster receives a query and finishes executing the query.
It can also be a symptom of insufficient disk I/O. The number of SQL statements being executed on the cluster at the same time.ĭegradation in SQL response time is the most common symptom of CPU starvation. The CPU consumption by the CockroachDB node process. The time between when the cluster receives a query and finishes executing the query. Monitor possible signs of CPU starvation: Parameter
See additional CPU recommendations in the Production Checklist.
The total number of workload connections across all connection pools should not exceed 4 times the number of vCPUs in the cluster by a large amount.

Use connection pooling to manage workload concurrency.Use larger VMs to handle temporary workload spikes and processing hot spots.Each node should have at least 4 vCPUs.Provision enough CPU to support your operational and workload concurrency requirements: Category

Issues with CPU most commonly arise when there is insufficient CPU to support the scale of the workload. In our sizing and production guidance, 1 vCPU is considered equivalent to 1 core in the underlying hardware platform.

0 Comments

Normal disk graph

Leave a Reply.

Author

Archives

Categories