For the Apache Cassandra cluster (version C* 1.2.16) that I’ve supported, the monitoring of the cluster is through DataStax OpsCenter, version 4.1.2. As part of the effort to improve the monitoring capability for this cluster, my team decided first to upgrade OpsCenter to version 5.1.4, the latest available version of OpsCenter that is compatible with Cassandra 1.2.16. The same OpsCenter is also used to monitor another cluster of DataStax Enterprise (DSE) 4.5.2 (it corresponds to Apache Cassandra version 2.0.10).
During the upgrade we ran into an issue, and we couldn’t find a similar problem to this one on Google. We’d like to use this post to document the problems that we faced as well as the solutions and findings we found during the upgrade.
The OpsCenter upgrade procedure is as straightforward as what is described in the DataStax OpsCenter document. After OpsCenter upgrade, the OpsCenter web portal detects mismatched version of datastax-agents on all nodes. Choose the “FixMe” option from the portal to upgrade datastax-agents to version 5.1.4 on all nodes being monitored. After the datastax-agent upgrade, we addressed some datastax-agent configuration issues in “address.yaml” file to reflect the changes brought by the new version of OpsCenter/datastax-agent.
After all this was done, we double checked the log files for OpsCenter and datastax-agent. The OpsCenter log file was mostly clear, the datastax-agent log for DSE 4.5.2 cluster was also clear, but the datastax-agent log for Cassandra 1.2.16 was NOT. The corresponding OpsCenter web portal was not able to display Cassandra metrics for C* 1.2.16 cluster.
On each of the datastax-agent log files in the C* 1.2.16 cluster, we saw a lot of repeating errors like the ones below:
ERROR [async-dispatch-3] 2016-02-19 12:57:52,750 There was an error when attempting to load stored rollups.
com.datastax.driver.core.exceptions.InvalidQueryException: Undefined name key in where clause (‘key EQ ‘<… …>”)
ERROR [cassandra-processor-1] 2016-02-19 13:00:02,295 Error when proccessing cassandra callcom.datastax.driver.core.exceptions.InvalidQueryException: Unknown identifier key
Problem Analysis and Solution
The fact that the error showed up in datastax-agent log file gave me a hint that the error might be related with datastax-agent failing to write collected metrics into OpsCenter tables. So as the first step of the analysis, I compared the schema of “OpsCenter” keyspace between the two clusters monitored. Below is the example of two OpsCenter table definition comparison between the two clusters.
C* 1.2.16 Cluster
DSE 4.5.3 Cluster
|CREATE TABLE events ( “KEY” blob, column1 blob, value blob, PRIMARY KEY (“KEY”, column1)||CREATE TABLE events ( key text, action bigint, level bigint, success boolean, time bigint, PRIMARY KEY ((key))|
|CREATE TABLE events_timeline ( “KEY” blob, column1 bigint, value blob, PRIMARY KEY (“KEY”, column1)||CREATE TABLE events_timeline ( key text, column1 bigint, value blob, PRIMARY KEY ((key), column1)|
From this table, we can clearly see that the upgrade process of OpsCenter and datastax-agent to verion 5.1.4 somehow doesn’t migrate OpsCenter schema properly for C* 1.2.16 cluster. The theory for the error is that the upgraded datastax-agent in C* 1.2.16 cluster is trying to query or update Cassandra metrics from OpsCenter tables in a fashion that matches the OpsCenter schema as in the DSE 4.5.2 cluster. But the actual OpsCenter schema in C* 1.2.16 still has the old definition, thus causing the invalid query exception as presented in the log file.
Once the problem is clear, the solution is straightforward. The steps are summarized below:
In C* 1.2.16 cluster,
- Take a snapshot for OpsCenter keyspace on all nodes
- Stop DataStax agents on all nodes, so they won’t try to write metrics into OpsCenter tables.
- Use CQL to drop OpsCenter tables and re-create them, matching the OpsCenter schema for DSE 4.5.3 cluster. Make sure that all table properties are the same.
- Once OpsCenter schema is recreated. Start DataStax agents on all nodes in.
- Verify the agent log file that the error message is gone.
- Restart OpsCenter service.
After these steps, we double checked the log files for all datastax-agents, and for OpsCenter, and we can confirm that there were no errors. The OpsCenter web portal was also able to display the Cassandra metrics properly.