Solace PubSub+ Troubleshooting

This topic describes how operators and developers can troubleshoot instances of Solace PubSub+ services.

A Solace PubSub+ Service Instance is backed by a Message VPN on one or many Solace PubSub+ Software Event Broker, depending on the chosen plan.

You can discover the Solace PubSub+ Software Event Broker backing a service instance from the Solace PubSub+ Credentials which become available by binding an app to the instance or creating a service key for the instance.

Operator Troubleshooting

The operator will have access to the logs of all the components of Solace PubSub+ for VMware Tanzu:

  • Service Broker Logs
  • Service Broker Agent Logs
  • Solace PubSub+ Event Broker Logs

An installation of Solace PubSub+ for VMware Tanzu may be configured to use System Logging as a means of gathering all logs.

Accessing Logs

The following locations in Solace PubSub+ BOSH VMs will hold logs of interest that can help in troubleshooting issues:

  • All VM job logs
    • /var/vcap/sys/log
  • All Solace PubSub+ Event Broker Logs
    • /var/vcap/store/containers/pubsub/volumes/jail/logs

The service broker logs are collected in the management VM under /var/vcap/sys/log/broker-logs.

When the Management app is deployed in High Availability mode, the service broker logs are captured on both the primary and backup VMS so it does not matter from which you get the logs.

This means that you can use the bosh logs command to download the service broker logs along with the other logs from the management VM.

You can also watch service broker logs as follows:

  1. Set your API endpoint to the Cloud Controller of your deployment.

    $ cf api api.YOUR-SYSTEM-DOMAIN
    Setting api endpoint to api.YOUR-SYSTEM-DOMAIN...
    OK
    API endpoint:  https://api.YOUR-SYSTEM-DOMAIN (API version: 2.82.0)
    Not logged in. Use 'cf login' to log in.
    

  2. Log in to your deployment and select an org and a space.

    $ cf login
    API endpoint: https://api.YOUR-SYSTEM-DOMAIN
    Email> user@example.com
    Password>
    

  3. Target the solace org and solace-broker space.

    $ cf target -o solace -s solace-broker
    api endpoint:   https://api.YOUR-SYSTEM-DOMAIN
    api version:    2.82.0
    user:           admin
    org:            solace
    space:          solace-broker
    

  4. Discover the Solace PubSub+ Service Broker App name.

    $ cf apps
    Getting apps in org solace /space solace-broker as user@example.com...
    OK

    name requested state instances memory disk urls solace-broker-1.2.0 started 1/1 1G 512M solace-broker.YOUR-SYSTEM-DOMAIN

    OK

    Note: Take note of the application name. In this case ‘solace-broker-1.2.0’

  5. To watch the logs of Solace PubSub+ Service Broker

    $ cf logs solace-broker-1.2.0 | tee saved_solace-broker-1.2.0.txt
    Retrieving logs for app solace-broker-1.2.0 in org solace /space solace-broker as user@example.com...
    ....
    

Additional Diagnostics

In addition to regular logging, Solace PubSub+ Event Brokers tools can help gather diagnostics logs on demand.

For example, having identified a problem with a given Solace service instance, the operator can access the backing Solace PubSub+ Event Brokers to examine logs and gather diagnostics.

The backing Solace PubSub+ Event Brokers for a given service instance can be identified from the IPs used in the bindings or service keys, or mentioned in the service broker logs or message when an exception occurs.

The operator should look for the VM in the Solace PubSub+ bosh deployment with the matching IP.

Get additional diagnostics from “EnterpriseLarge/0” VM.

  $ bosh ssh EnterpriseLarge/0
  # sudo -i
  # /var/vcap/jobs/broker_agent/bin/gather_diagnostics.sh
  
Then you can look at the gathered logs under /var/vcap/store/diag.

In the event that a process fails and creates a core dump, the core file can be found under /var/vcap/store/cores. Core files are compressed using gzip.

Installation Issues

If you get the error

Colocated job 'solace-bosh-dns-aliases' is already added to the instance group 'compute'

then do the following:

  • Log into your bosh environment
  • Execute the command 'bosh configs’
  • Look for runtime-configs with 'solace’ in their name
  • Delete the solace runtime-configs using the bosh 'delete-config’ command
  • Redeploy

Upgrade Issues

The following issues may arise while upgrading a tile:

Error
Action Failed get_task: result: 1 of 1 drain scripts failed. Failed Jobs: containers.
ExplanationA pre-condition was not met. A high availability (HA) group was not in a healthy state at the start of the upgrade. This check is done at the shutdown of each VM in an HA group.
Possible ActionEnsure an HA Group is healthy. The user can log into the failing VM and look at the log file located at /var/vcap/sys/log/containers/drain.stdout.log to determine the underlying cause. Once an HA Group is healthy, an upgrade can be retried.


Error
Action Failed get_task: result: 1 of 1 post-start scripts failed. Failed Jobs: containers.
Explanation The upgrade was stopped due to a detected failure to ensure services remain available.
Possible ActionThe user should log into the failing VM and look at the log file located at /var/vcap/sys/log/containers/post-start.stdout.log to determine the underlying cause. Once the issue is fixed, an upgrade can be retried.


Error
Process 'mariadb_ctrl' Execution failed
ExplanationMonit reports that MariaDB failed to start. This happens when using high availability management and 2 or more MariaDB processes are shutdown at the same time. The cluster has lost quorum and MariaDB processes can’t restart automatically.
Possible ActionThe MariaDB cluster needs to be bootstrapped. This is done by running the bootstrap errand with the command: bosh -d solace_deployment run-errand bootstrap. The bootstrap errand is always safe to run as it is a noop when it is not required. If the bootstrap errand does not fix the MariaDB process, the user should look at logs under /var/vcap/sys/log/mysql/.

Note: For Shared service plans, every service on a broker must be marked for upgrade before the upgrade can happen. This prevents one tenant from upgrading a router before the other tenants are ready.

Note: If some Solace PubSub+ Event Brokers have been upgraded and others have not, it is not possible to create new bindings or service keys with the non-upgraded services. This is because features such as authentication schemes might have changed with the upgrade, and would not be compatible with the services created before the upgrade.

Developer Troubleshooting

A developer using Solace PubSub+ for VMware Tanzu may encounter errors when using Cloud Foundry Command-Line Interface (cf CLI) to perform basic operations on a Solace PubSub+ for VMware Tanzu service instance.

In general, most errors are about these types:

  • Reaching limits (plan limits, inventory fully used)
  • Communication problems between the service broker and the VM inventory it manages
  • Unexpected health state, such as a degraded high availability setup

While this list it not complete, it provides representative samples with explanations and possible resolutions. Some of the resolutions will require operator intervention.

Deployment limits reached for a given plan
Operation
cf create-service solace-pubsub enterprise-large
Error
Server error, status code: 502, error code: 10001, message: Service broker error: 
com.solace.cloudfoundry.servicebroker.exception.SolaceServiceException: No matching Solace Message VPNs available.
ExplanationThe service broker does not find any Solace Message VPNs in its inventory for the requested service plan.
Possible Action(s)The operator needs to increase the number of allocated Solace PubSub+ Event Brokers that support the given plan, in this case a enterprise-large.


Invalid Parameter
Operation
cf update-service my-large-instance -c '{"some_option": "some_value" }'
Error
Server error, status code: 502, error code: 10001, message: Service broker error: Unrecognized parameter key some_option
ExplanationThe service broker does not recognize the parameter name.
Possible Action(s)Use the correct parameters. Please see Service Specific Parameters.


Invalid Parameter: The required feature is not enabled (TCP Routes)
Operation
cf update-service my-large-instance -c '{ "mqtt_tcp_route_enabled" : "false" }'
Error
Server error, status code: 502, error code: 10001, message: Service broker error: The parameter 
mqtt_tcp_route_enabled is invalid given the current configuration. It requires [ TCP Routes Enabled ]
ExplanationAs indicated in the error, the given parameter is only valid when TCP Routes is enabled.
Possible Action(s)See Configuring TCP Routes.


Invalid Parameter: The required feature is not enabled (LDAP)
Operation
cf update-service my-large-instance -c '{ "ldapGroupAdminReadOnly" : "cn=username1,ou=groups,dc=solace,dc=com" }'
Error
Server error, status code: 502, error code: 10001, message: Service broker error: The parameter ldapGroupAdminReadOnly 
is invalid given the current configuration. It requires [ LDAP Enabled, Management Access set to LDAP ]
ExplanationAs indicated in the error, the given parameter is only valid when LDAP is enabled and Management Access is set to LDAP.
Possible Action(s)See Configuring LDAP and Management Access to use LDAP.


Communication failure: The Solace PubSub+ Event Brokers is not reachable.
Operation
cf delete-service -f my-large-instance
Error
cf service my-large-instance

Service instance: my-large-instance
Service: solace-pubsub
Bound apps:
Tags:
Plan: enterprise-large
Description: Solace PubSub+ Event Broker for real-time, multi-protocol data distribution
Documentation url: http://docs.solace.com
Dashboard: https://enterprise-large-0.YOUR-SYSTEM.DOMAIN/#/msg-vpns/djAwNQ==?token=
YWJj.eyJhY2Nlc3NfdG9rZW4iOiJ2MDA1LWlaksjdlasdjas09dasdlkansdlakslZmRmOWFlNjM3ZGYwMDcifQ%3D%3D.eHl6

Last Operation
Status: delete failed
Message: com.solace.cloudfoundry.servicebroker.exception.SolaceServiceException: 
Unable to delete Service, the associated Message VPN v001 on 10.244.0.3 is not currently available
Started: 2020-01-00T00:00:00Z
Updated: 2020-01-00T00:00:00Z
ExplanationThe service broker cannot delete this instance because the Message VPN is flagged as unavailable. This happens when the backing Solace PubSub+ Event Broker is not reachable.
Possible Action(s)The operator should examine the solace deployment and see why VM 10.244.0.3 is not available.

The service delete operation can be reattempted once the VM health is restored.


HA service degradation.
Operation
cf bind-service my-app my-ha-instance
Error
MessageRouterException: Primary VMR 10.233.0.3 HA Group Status is degraded v001 for messageVpn v001
ExplanationThe service broker will always reject an operation for an HA service when the HA status is degraded. A variety of reasons may be given in the message. Other operations on the same Message Router will fail as well.
Possible Action(s)The operator should examine the Solace deployment and see why VM 10.244.0.3 is not available.

CF operation can be reattempted once the VM health is restored and the HA status is not degraded.


Orphaned Resource Policy
Operation
cf unbind-service my-app my-service-instance
Error
Server error, status code: 502, error code: 10001, message: Service instance test-shared: Service broker error: 
Operation canceled. Orphaned Endpoints Policy was violated.
Endpoints owned by v002.cu000001 must first be deleted:
        Queues: [ someQ ]
        Topics Endpoints: [ someTPE ]
ExplanationThe service broker will reject unbinding when the client-username that was used by the application owns Endpoints such as Queues and Topic Endpoints while the current Orphaned Resource Policy is to Abort
Possible Action(s)Delete the Endpoints before unbinding. Alternatively you can adjust the Service Orphaned Resource Policy.


Standby Replication VPN Mode
Operation
cf update-service my-app -c '{"vpnOptions:"{...}}'
cf bind-service my-app my-service-instance # if LDAP is not the authentication scheme
cf unbind-service my-app my-service-instance # if LDAP is not the authentication scheme
Client Profile management in the Solace Service Dashboard
Error
The operation [...] is not supported on message-vpn v001 with its current configuration: replication admin-state [enabled] 
and config-state [standby]. Management operations may only be run on the 'active' replication group
Explanation If a service’s message VPN has been configured to be part of a replication group and its configuration state is set to 'standby’, all unsupported configuration operations are rejected. This includes creating or deleting client usernames, updating VPN options through cf update-service, or configuring client profiles through the service dashboard.
Possible Action(s)Configuration is still allowed on the 'active’ member of the replication group and will be config sync’d between the two. Managing bindings when replication configuration is set to standby and when using internal authentication is not supported.


Dashboard Troubleshooting

Dashboard 403 Error using VMware Tanzu SSO
Operation Accessing the service dashboard with Single Sign On
Error On the dashboard’s webpage,
403: Forbidden
is shown.
ExplanationThe user’s permissions are determined through a call to the CF API controller with the given service and user’s access token. CF then tells the dashboard whether the user has read and manage permissions.
Possible Action(s)Verify that the account used to access the dashboard has permission to read/manage the service instance by verifying their organization and space roles in CloudFoundry.

Note: Users with 'admin’ privileges still require an appropriate assigned role in the org and space to view and service instances and specifically require the SpaceDeveloper permission to manage the service instances. This is a limitation of VMware Tanzu/cloudfoundry UAA used in Single Sign On.

If Single Sign On access for Solace PubSub+ was revoked through the Third Party Access pane on the VMware Tanzu User Management page, logout of the dashboard by clicking the user icon in the top right corner and clicking logout and retry accessing the dashboard through the dashboard url.