cloud-panda-logo-img
Virtualization

VMware vROPS 6.x Cluster Having Poor Performance

blog-default-background-theme


Overview:

In VMware vROPS 6.x, sometimes the Casandra database load on each clustered node goes very high as listed in below screenshot. However VMware claims - this issue got fixed in 6.6 version. It also cause “Failed to Disable” HA error on the admin UI page.

Prerequisites:

  • Make sure we have snapshots for all the nodes of a cluster.
  • Make sure there are recent successful image level backup for all the nodes of a cluster.

Procedure:

  • In Admin UI, ensure that all nodes are taken offline by clicking “Take Offline” under “Cluster Status”
  • If this button is greyed out or in case it’s not available, select each node and click Take Node Offline.
  • If you are unable to do the above step then follow the below listed step as alternate option to do it.
  • Log in to the master node as the root user and repeat this process for all other nodes in the Cluster.
    • cloudpandavrops1:'#service vmware-casa stop
    • cloudpandavrops1:'#service vmware-vcops stop
  • The nodes should be taken offline in this order - data nodes, master replica and master node.
  • Force the Cassandra DB online so that we can work with it without reads/writes taking place.
    • cloudpandavrops1:'# service vmware-vcops start cassandra force
  • Once cassandra DB is online, run the commands against the DB to truncate three different tables.
    • globalpersistence.activity_2_tbl
    • globalpersistence.activityresults_tbl
    • globalpersistence.queueid_tbl
  • Before execute the DB commands, check the load of each vROPS Cassandra DB node.
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cass*/bin/nodetool -p 9008 status
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status
    • cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activity_2_tbl" &
    • cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activityresults_tbl" &
    • cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.queueid_tbl" &
  • Once these tables are truncated, run a repair operation against the DB to ensure all nodes were in sync.
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password repair -par
  • Once it’s all in sync, confirm the load on the Cassandra DB is reduced from 18GB to 1GB
    • cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status
  • Bring the cluster back online and after some time check if those all objects were back in a “Collecting” or “Data Receiving” state.
  • If the cluster won’t come online then try to Force the Cassandra DB offline (it’s an optional step)
    • cloudpandavrops1:'#service vmware-vcops stop cassandra force
  • The nodes should be brought online in reverse order, once the activity gets complete.
    • cloudpandavrops1:'#service vmware-casa start
    • cloudpandavrops1:'#service vmware-vcops start
  • Then click around in the environment and the UI is as responsive as we would expect.
  • Once we confirm the environment is back online and behaving as expected, we can check if there is any HA error in admin UI like “Failed to disable HA”
  • If we notice the above error, we have to follow the below listed steps to rectify it. However this error will not cause any impact to the cluster functionality.
  • This required bringing the 'casa' and vROPS service offline so that we can make edits to a file read on casa's startup to correct the error on this page.
    • cloudpandavrops1:'#service vmware-casa stop
    • cloudpandavrops1:'#service vmware-vcops stop
    • cloudpanda01:'# vi /storage/db/casa/webapp/hsqldb/casa.db.script
      • Change “is_ha_enabled":failed to disable to  “is_ha_enabled":true
      • Change "initialization_state":"failed to disable" to "initialization_state":"NONE"
  • After modifying the line it should look something like this. Here is a sample line.
INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINE","cluster_name":"vROPS-Prod","is_ha_enabled":true,"ha_transition_state":"NONE","initialization_state":"NONE","remove_node_state":"NONE","document_version":84,"document_time":1515169871248,"online_state":"ONLINE","online_state_time":1515169871242,"online_state_reason":"","cluster_members":[],"admin_slices":[],"installation_state":"DONE","slices":{"a436f79c-dc0c-40ec-a915-b7e256ba6ef6":{"slice_uuid":"a436f79c-dc0c-40ec-a915-b7e256ba6ef6","is_admin_node":true,"ip_address":"cloudpandavrops1.ce.corp.com","preferred_addresses":{},"slice_name":"cloudpandavrops1","membership_state":null},"0cdd8bc1-1610-411e-9c8b-fae36b46857a":{"slice_uuid":"0cdd8bc1-1610-411e-9c8b-fae36b46857a","is_admin_node":false,"ip_address":"cloudpandavrops2.ce.corp.com","preferred_addresses":{},"slice_name":"cloudpandavrops2","membership_state":null}}}')
  • Once we bring casa and vROPS back online we can verify HA reported as “Enabled” as expected.
    • cloudpandavrops1:'#service vmware-casa start
    • cloudpandavrops1:'#service vmware-vcops start
  • At this point we can let the environment run as is for some time to monitor further.
vROPs Log Files:
  • #cd /storage/log/vcops/log/casa
    • #tail pakManager.actions.log
    • #tail casa-gc.log
    • #tail casa-performance.log
    • #tail casa-rest-calls.log
    • #tail casa.log
    • #tail casa_cassandra.log
    • #tail catalina.out
    • #tail pakManager.query.log
  • #cd /var/log/vcops_logs/ or #cd /var/log/vmware/vcops
    • #tail vcops-services-startup.log
    • #tail vcops-firstboot.log
    • #tail vcops-upgrade.log
Additional Information:  
Tags:

Write Review

  1. Your email address and mobile number will not be published. Required fields are marked *