VMware vROPS 6.x Cluster Having Poor Performance



In VMware vROPS 6.x, sometimes the Casandra database load on each clustered node goes very high as listed in below screenshot. However VMware claims - this issue got fixed in 6.6 version. It also cause “Failed to Disable” HA error on the admin UI page.


  • Make sure we have snapshots for all the nodes of a cluster.
  • Make sure there are recent successful image level backup for all the nodes of a cluster.


  • In Admin UI, ensure that all nodes are taken offline by clicking “Take Offline” under “Cluster Status”
  • If this button is greyed out or in case it’s not available, select each node and click Take Node Offline.
  • If you are unable to do the above step then follow the below listed step as alternate option to do it.
  • Log in to the master node as the root user and repeat this process for all other nodes in the Cluster.
    • cloudpandavrops1:'#service vmware-casa stop
    • cloudpandavrops1:'#service vmware-vcops stop
  • The nodes should be taken offline in this order - data nodes, master replica and master node.
  • Force the Cassandra DB online so that we can work with it without reads/writes taking place.
    • cloudpandavrops1:'# service vmware-vcops start cassandra force
  • Once cassandra DB is online, run the commands against the DB to truncate three different tables.
    • globalpersistence.activity_2_tbl
    • globalpersistence.activityresults_tbl
    • globalpersistence.queueid_tbl
  • Before execute the DB commands, check the load of each vROPS Cassandra DB node.
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cass*/bin/nodetool -p 9008 status
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status
    • cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activity_2_tbl" &
    • cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activityresults_tbl" &
    • cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.queueid_tbl" &
  • Once these tables are truncated, run a repair operation against the DB to ensure all nodes were in sync.
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password repair -par
  • Once it’s all in sync, confirm the load on the Cassandra DB is reduced from 18GB to 1GB
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status
  • Bring the cluster back online and after some time check if those all objects were back in a “Collecting” or “Data Receiving” state.
  • If the cluster won’t come online then try to Force the Cassandra DB offline (it’s an optional step)
    • cloudpandavrops1:'#service vmware-vcops stop cassandra force
  • The nodes should be brought online in reverse order, once the activity gets complete.
    • cloudpandavrops1:'#service vmware-casa start
    • cloudpandavrops1:'#service vmware-vcops start
  • Then click around in the environment and the UI is as responsive as we would expect.
  • Once we confirm the environment is back online and behaving as expected, we can check if there is any HA error in admin UI like “Failed to disable HA”
  • If we notice the above error, we have to follow the below listed steps to rectify it. However this error will not cause any impact to the cluster functionality.
  • This required bringing the 'casa' and vROPS service offline so that we can make edits to a file read on casa's startup to correct the error on this page.
    • cloudpandavrops1:'#service vmware-casa stop
    • cloudpandavrops1:'#service vmware-vcops stop
    • cloudpanda01:'# vi /storage/db/casa/webapp/hsqldb/casa.db.script
      • Change “is_ha_enabled":failed to disable to  “is_ha_enabled":true
      • Change "initialization_state":"failed to disable" to "initialization_state":"NONE"
  • After modifying the line it should look something like this. Here is a sample line.
INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINE","cluster_name":"vROPS-Prod","is_ha_enabled":true,"ha_transition_state":"NONE","initialization_state":"NONE","remove_node_state":"NONE","document_version":84,"document_time":1515169871248,"online_state":"ONLINE","online_state_time":1515169871242,"online_state_reason":"","cluster_members":[],"admin_slices":[],"installation_state":"DONE","slices":{"a436f79c-dc0c-40ec-a915-b7e256ba6ef6":{"slice_uuid":"a436f79c-dc0c-40ec-a915-b7e256ba6ef6","is_admin_node":true,"ip_address":"","preferred_addresses":{},"slice_name":"cloudpandavrops1","membership_state":null},"0cdd8bc1-1610-411e-9c8b-fae36b46857a":{"slice_uuid":"0cdd8bc1-1610-411e-9c8b-fae36b46857a","is_admin_node":false,"ip_address":"","preferred_addresses":{},"slice_name":"cloudpandavrops2","membership_state":null}}}')
  • Once we bring casa and vROPS back online we can verify HA reported as “Enabled” as expected.
    • cloudpandavrops1:'#service vmware-casa start
    • cloudpandavrops1:'#service vmware-vcops start
  • At this point we can let the environment run as is for some time to monitor further.
vROPs Log Files:
  • #cd /storage/log/vcops/log/casa
    • #tail pakManager.actions.log
    • #tail casa-gc.log
    • #tail casa-performance.log
    • #tail casa-rest-calls.log
    • #tail casa.log
    • #tail casa_cassandra.log
    • #tail catalina.out
    • #tail pakManager.query.log
  • #cd /var/log/vcops_logs/ or #cd /var/log/vmware/vcops
    • #tail vcops-services-startup.log
    • #tail vcops-firstboot.log
    • #tail vcops-upgrade.log
