In this blog, I’m going to give a detailed guide on how to monitor a Cassandra cluster with Prometheus and Grafana.
For this, I’m using a new VM which I’m going to call “Monitor VM”. In this blog post, I’m going to work on how to install the tools. In a second one, I’m going to go through the details on how to do use and configure Grafana dashboards to get the most out of your monitoring!
High level plan
Monitor VM
- Install Prometheus
- Configure Prometheus
- Install Grafana
Cassandra VMs
- Download prometheus JMX-Exporter
- Configure JMX-Exporter
- Configure Cassandra
- Restart Cassandra
Detailed Plan
Monitor VM
Step 1. Install Prometheus
$ wget https://github.com/prometheus/prometheus/releases/download/v2.3.1/prometheus-2.3.1.linux-amd64.tar.gz $ tar xvfz prometheus-*.tar.gz $ cd prometheus-* |
Step 2. Configure Prometheus
$ vim /etc/prometheus/prometheus.yaml |
global: scrape_interval: 15s scrape_configs: # Cassandra config - job_name: 'cassandra' scrape_interval: 15s static_configs: - targets: ['cassandra01:7070', 'cassandra02:7070', 'cassandra03:7070'] |
Step 3. Create storage and start Prometheus
$ mkdir /data $ chown prometheus:prometheus /data $ prometheus --config.file=/etc/prometheus/prometheus.yaml |
Step 4. Install Grafana
$ wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana_5.1.4_amd64.deb $ sudo apt-get install -y adduser libfontconfig $ sudo dpkg -i grafana_5.1.4_amd64.deb |
Step 5. Start Grafana
$ sudo service grafana-server start |
Cassandra nodes
Step 1. Download JMX-Exporter:
$ mkdir /opt/jmx_prometheus $ wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.3.0/jmx_prometheus_javaagent-0.3.0.jar |
Step 2. Configure JMX-Exporter
$ vim /opt/jmx_prometheus/cassandra.yml |
lowercaseOutputName: true lowercaseOutputLabelNames: true whitelistObjectNames: [ "org.apache.cassandra.metrics:type=ColumnFamily,name=RangeLatency,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=LiveSSTableCount,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=SSTablesPerReadHistogram,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=SpeculativeRetries,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableOnHeapSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableSwitchCount,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableLiveDataSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableColumnsCount,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableOffHeapSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterFalsePositives,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterFalseRatio,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterDiskSpaceUsed,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterOffHeapMemoryUsed,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=TotalDiskSpaceUsed,*", "org.apache.cassandra.metrics:type=CQL,name=RegularStatementsExecuted,*", "org.apache.cassandra.metrics:type=CQL,name=PreparedStatementsExecuted,*", "org.apache.cassandra.metrics:type=Compaction,name=PendingTasks,*", "org.apache.cassandra.metrics:type=Compaction,name=CompletedTasks,*", "org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted,*", "org.apache.cassandra.metrics:type=Compaction,name=TotalCompactionsCompleted,*", "org.apache.cassandra.metrics:type=ClientRequest,name=Latency,*", "org.apache.cassandra.metrics:type=ClientRequest,name=Unavailables,*", "org.apache.cassandra.metrics:type=ClientRequest,name=Timeouts,*", "org.apache.cassandra.metrics:type=Storage,name=Exceptions,*", "org.apache.cassandra.metrics:type=Storage,name=TotalHints,*", "org.apache.cassandra.metrics:type=Storage,name=TotalHintsInProgress,*", "org.apache.cassandra.metrics:type=Storage,name=Load,*", "org.apache.cassandra.metrics:type=Connection,name=TotalTimeouts,*", "org.apache.cassandra.metrics:type=ThreadPools,name=CompletedTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=PendingTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=ActiveTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=TotalBlockedTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=CurrentlyBlockedTasks,*", "org.apache.cassandra.metrics:type=DroppedMessage,name=Dropped,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=HitRate,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Hits,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Requests,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Entries,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size,*", "org.apache.cassandra.metrics:type=Client,name=connectedNativeClients,*", "org.apache.cassandra.metrics:type=Client,name=connectedThriftClients,*", "org.apache.cassandra.metrics:type=Table,name=WriteLatency,*", "org.apache.cassandra.metrics:type=Table,name=ReadLatency,*", "org.apache.cassandra.net:type=FailureDetector,*", ] rules: - pattern: org.apache.cassandra.metrics<type=(Connection|Streaming), scope=(\S*), name=(\S*)><>(Count|Value) name: cassandra_$1_$3 labels: address: "$2" - pattern: org.apache.cassandra.metrics<type=(ColumnFamily), name=(RangeLatency)><>(Mean) name: cassandra_$1_$2_$3 - pattern: org.apache.cassandra.net<type=(FailureDetector)><>(DownEndpointCount) name: cassandra_$1_$2 - pattern: org.apache.cassandra.metrics<type=(Keyspace), keyspace=(\S*), name=(\S*)><>(Count|Mean|95thPercentile) name: cassandra_$1_$3_$4 labels: "$1": "$2" - pattern: org.apache.cassandra.metrics<type=(Table), keyspace=(\S*), scope=(\S*), name=(\S*)><>(Count|Mean|95thPercentile) name: cassandra_$1_$4_$5 labels: "keyspace": "$2" "table": "$3" - pattern: org.apache.cassandra.metrics<type=(ClientRequest), scope=(\S*), name=(\S*)><>(Count|Mean|95thPercentile) name: cassandra_$1_$3_$4 labels: "type": "$2" - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=(\S*)><>(Count|Value) name: cassandra_$1_$5 labels: "$1": "$4" "$2": "$3" |
Step 3. Configure Cassandra
echo 'JVM_OPTS="$JVM_OPTS -javaagent:/opt/prometheus-exporter/jmx_prometheus_javaagent-0.3.0.jar=7070:/opt/prometheus-exporter/cassandra.yaml"' >> conf/cassandra-env.sh |
Step 4. Restart Cassandra
$ nodetool flush $ nodetool drain $ sudo service cassandra restart |
And now, if you have no errors (and you shouldn’t!) your Prometheus is ingesting your Cassandra metrics!
Wait for the next blog post where I will guide you through a good Grafana configuration!
14 Comments. Leave new
Hi Carlos,
Nice article, I want to configure same in windows machine, Please help me out
Hi Carlos,
Thanks, My requirement is I have created Streaming pipeline from Oracle to cassandra. Is there any possible ways to monitor Both Table level daily counts using this approach.
@Sankar the Windows approach should be more straightforward. Just copy the configurations and start the applications where you have them extracted.
@Venkat, Counting in Cassandra is a really, really trick thing. You could use this approach to monitor the writes, but I would take it with a grain of salt. I might do a blog about that, is a common problem!
Hi Carlos,
nice article. Do you think is possible to monitor Cassandra DSE using Azure?
Is it possible to export the metrics to Azure Log Analytics or Application Insights?
Thanks
Hello everyone…
I need this configuration for Cassandra monitoring with grafana dashboards . Please help me on this
Thanks in advance
Hi , I need dashboards for this configuration. Please help me on this. Thanks in advance
Hi Carlos,
thanks for this nice and easy to follow article.
I hope that “next blog post where I will guide you through a good Grafana configuration” will come soon, since this is where I’m stuck now ;-)
BR,
Marc
Hello,
Thanks for the details.
May you please help me understand the below rule
rules:
– pattern: org.apache.cassandra.metrics<type=(Connection|Streaming), scope=(\S*), name=(\S*)><>(Count|Value)
name: cassandra_$1_$3
labels:
address: “$2”
How is this helping . I understand this is renaming the metrics but may you please elaborate on this.
Thanks
><> I think is the website attempting to convert angle brackets, this part of the config was broken for me so I just removed it. You might have better luck with the config on this page.
https://grafana.com/grafana/dashboards/5408
Yeah. (\S*)><>(Count|Value) Should be >
For whatever reason it converted it
Hi Carlos,
where is the grafana config ?
Hi Carlos,
Very well documented, Thanks.
Awaiting grafana dashboard.
Hi Carlos, nice post. I have my 3 cassandra nodes with 150gb each. I dont understand why you do at the end of the process:
nodetool flush
nodetool drain
Is that necessary? I dont want to flush all my data or write sstables to hard drive? What happens if I dont do that and just restart the cassandra service? Thanks
Hi Carlos,
I appreciate your effort . Nice Article.
Have a quick question.
I want to monitor the health of my cassandra cluster to know whether the endpoints are UP or DOWN.
Could you please suggest me the metric name or option I should explore for my requirement ?
Thank you in advance.