LATEST VERSION: 1.3 - CHANGELOG
PCF Metrics v1.3

Sizing PCF Metrics for Your System

This topic describes how to configure Pivotal Cloud Foundry (PCF) Metrics for high availability. Operators can use these procedures to optimize PCF Metrics for high capacity.

For more information about PCF Metrics components, see the PCF Metrics Product Architecture topic.

Configuring the Metrics Datastore

PCF Metrics stores metrics in a MySQL cluster.

To customize PCF Metrics for high capacity, you can add memory and persistent disk to the MySQL server nodes.

Considerations for Scaling

Because apps emit logs at different volumes and frequencies, you should not scale the MySQL server nodes in accordance to the number of app instances in your deployment. Because of the ease in scaling these components, we recommend starting with a minimal configuration then evaluating performance over a period of time and scaling. As long as persistent disk is being scaled up, there should not be any fear of losing data.

To determine approximate starting memory and disk allocation for each MySQL server node, use the following example configurations:

Small: Test Use ~100 Apps

WARNING: Do NOT attempt to size your PCF Metrics deployment any smaller than this setting. These are the minimum resources required.

Job Instances Persistent Disk Type VM Sizing
MySQL Server 1 50 GB CPU: 1, RAM: 1 GB, disk: 8 GB
MySQL Proxy 1 None CPU: 1, RAM: 1 GB, disk: 8 GB
MySQL Monitor 0 None N/A

Medium: Production Use ~5000 Apps

Job Instances Persistent Disk Type VM Sizing
MySQL Server 2 500 GB CPU: 2, RAM: 12 GB, disk: 32 GB
MySQL Proxy 1 None CPU: 2, RAM: 4 GB, disk: 8 GB
MySQL Monitor 1 None CPU: 2, RAM: 4 GB, disk: 8 GB

Large: Production Use ~15000 Apps

Job Instances Persistent Disk Type VM Sizing
MySQL Server 3 ~1.5 TB CPU: 8, RAM: 64 GB, disk: 160 GB
MySQL Proxy 1 None CPU: 2, RAM: 8 GB, disk: 32 GB
MySQL Monitor 1 None CPU: 2, RAM: 4 GB, disk: 8 GB

Use these results as guidelines. Consider configuring your MySQL server nodes with additional memory and disk. If your deployment adds additional app instances, consider configuring your MySQL server nodes with additional memory and disk.

Procedures for Scaling

After determining the amount of memory and persistent disk required for each MySQL server node, perform the following steps:

  1. Navigate to the Ops Manager Installation Dashboard and click the Metrics tile.
  2. From the Settings tab of the Metrics tile, click Resource Config.
  3. Modify the memory limit or persistent disk allocation as needed for your environment.

    WARNING: There have been issues with Ops Manager BOSH Director using persistent disks larger than 2 TB.

  4. If you modify the memory allocation for the MySQL server nodes, you must also update the MySQL InnoDB Buffer Size setting. Pivotal recommends that you set the buffer size to 80% of the memory allocated to that VM. To change the MySQL InnoDB Buffer Size:

    1. Navigate to the Ops Manager Installation Dashboard and click the Metrics tile.
    2. From the Settings tab of the Metrics tile, click Data Store.
    3. Update the MySQL InnoDB Buffer Size input field.

Configuring the Log Datastore

PCF Metrics uses Elasticsearch to store logs. Each Elasticsearch node contains multiple shards of log data, divided by time slice. To customize PCF Metrics for high capacity, you can scale the number of Elasticsearch data nodes.

Considerations for Scaling

To determine the number of Elasticsearch data nodes required for PCF Metrics, consider how many logs the apps in your deployment emit and the average size of each log.

If your average log size is 1 kilobyte, and each node has 1 terabyte of available disk space, then each node has a maximum storage capacity of 1 billion log messages. If your apps emit 3 billion logs over a 24-hour period, you need at least 3 nodes to hold the data and 3 additional nodes for high-availability replication.

This example assumes that your apps emit logs at a continuous rate over 24 hours. However, apps typically do not emit logs continuously. If your apps emit 2 billion of the 3 billion logs between 8 AM and 4 PM, you must determine the minimum node-to-shard ratio to accommodate that rate over the 8-hour period. Because your apps emit 1 billion logs over a 4 hour span, you need at least 6 nodes (24 hours/6 nodes = 4 hours worth of shards per node) to hold the data and an additional 6 nodes for high-availability replication.

You can also use the throughput of logs per minute to help determine how many Elasticsearch data nodes to provision. As a general rule, provision one data node for every 5000 logs received in one minute.

Procedures for Scaling

WARNING: If you modify the number of Elasticsearch instances, the Elasticsearch cluster temporarily enters an unhealthy period during which it does not ingest any new logs data, due to shard allocation.

After determining the number of Elasticsearch nodes needed for your deployment, perform the following steps to scale your nodes:

  1. Navigate to the Ops Manager Installation Dashboard and click the Metrics tile.
  2. From the Settings tab of the Metrics tile, click Resource Config.
  3. Locate the ElasticSearchData job and select the dropdown menu under Instances to change the number of instances.
    Elasticsearch
  4. Click Save.

Configuring the Temporary Datastore

Note: PCF Metrics uses the temporary datastore only when upgrading from v1.3 or later.

During Elasticsearch downtime, including upgrades, PCF Metrics stores logs from the Loggregator Firehose in a temporary Redis datastore. When the upgrade finishes, the the Elasticsearch logqueue restores data from the temporary datastore by placing it in Elasticsearch. See PCF Metrics Product Architecture for more information.

Procedure for Scaling

This procedure describes how to size the temporary datastore for your system by calculating your requirements and setting the VM size in Ops Manager.

Calculate Storage Requirements

Calculate the storage requirements of the temporary datastore using the table below as a guide:

Variable Estimate a value for the variable
AvgLogSize Collect a sample of logs from your apps and calculate the average size of a log file in KB.

Note: PCF apps run on Diego cells that do not emit individual logs greater than approximately 60 KB. See Log Message Size Constraints.

LogsPerSec
  1. Install the nozzle plugin for the cf CLI if you do not already have it.
  2. Run following command to estimate of the volume of LogMessages emitted from the Loggregator Firehose. Every 10 seconds, the command gives you the average number of log messages per second, averaged over the previous 10 seconds.
    $ cf nozzle -filter LogMessage | pv -l -i10 -r >/dev/null
    [50.5 /s]
UpgradeTime Estimate how long your PCF Metrics upgrade will take. In tests in a large production environment, the upgrade takes 20–30 minutes.
RedisRAM Find the value for RedisRAM using the following calculation:
RedisRAM = AvgLogSize * LogsPerSec * UpgradeTime * 60
AdjustedRedisRAM Because Redis only allows VMs to use 45% of their allocated RAM, adjust the storage requirement as follows:
AdjustedRedisRAM = RedisRAM / 0.45

Set VM Size

Follow these steps to configure the size of the Redis VM for the temporary datastore based on your calculations.

Note: In the case that the temporary datastore becomes full, Redis uses the volatile-ttl eviction policy to continue storing incoming logs. For more information, see the Eviction policies section of Using Redis as an LRU cache.

  1. Navigate to the Ops Manager Installation Dashboard and click the Redis tile.

  2. From the Settings tab, click Resource Config.

  3. In the Dedicated Node row, under VM Type, select an option with at least enough RAM to support the value for AdjustedRedisRAM.

  4. Click Save.

Configuring the Ingestor

PCF Metrics deploys the Ingestor as an app within PCF. The Ingestor consumes logs and metrics from the Loggregator Firehose, sending metrics and logs to their respective Logqueue apps. To customize PCF Metrics for high capacity, you can scale the number of Ingestor app instances and increase the amount of memory per instance.

Considerations for Scaling

Because apps emit logs at different volumes and frequencies, you should not scale the Ingestor by matching the number of Ingestor instances to the number of app instances in your deployment.

Because Ingestor performance is affected by Loggregator performance, it can be difficult to determine in advance the proper configuration. Because of the ease in scaling these components, we recommend starting with a minimal configuration then evaluating performance over a period of time and scaling

The Ingestor app can handle relatively large loads. For high availability, you must have at least two instances of the Ingestor app running. If your deployment runs fewer than 2000 app instances, two instances of the Ingestor app are sufficient.

Procedures for Scaling

WARNING: If you decrease the number of Ingestor instances, you may lose data currently being processed on the instances you eliminate.

After determining the number of Ingestor app instances needed for your deployment, perform the following steps to scale the Ingestor:

  1. Target your Cloud Controller with the Cloud Foundry Command Line Interface (cf CLI). If you have not installed the cf CLI, see the Installing the cf CLI topic.

    $ cf api api.YOUR-SYSTEM-DOMAIN
    Setting api endpoint to api.YOUR-SYSTEM-DOMAIN...
    OK
    API endpoint:   https://api.YOUR-SYSTEM-DOMAIN (API version: 2.54.0)
    Not logged in. Use 'cf login' to log in.
    

  2. Log in with your UAA administrator credentials. To retrieve these credentials, navigate to the Pivotal Elastic Runtime tile in the Ops Manager Installation Dashboard and click Credentials. Under UAA, click Link to Credential next to Admin Credentials and record the password.

    $ cf login
    API endpoint: https://api.YOUR-SYSTEM-DOMAIN

    Email> admin Password> Authenticating... OK

  3. When prompted, target the metrics space.

    Targeted org system

    Select a space (or press enter to skip): 1. system 2. notifications-with-ui 3. autoscaling 4. metrics

    Space> 4 Targeted space metrics

    API endpoint: https://api.YOUR-SYSTEM-DOMAIN (API version: 2.54.0) User: admin Org: system Space: metrics

  4. Scale your Ingestor app to the desired number of instances:

    $ cf scale metrics-ingestor -i INSTANCE-NUMBER

  5. Evaluate the CPU and memory load on your Ingestor instances:

    $ cf app metrics-ingestor
    Showing health and status for app metrics-ingestor in org system / space metrics as admin...
    OK
    
    requested state: started instances: 1/1 usage: 1G x 1 instances urls: last uploaded: Sat Apr 23 16:11:29 UTC 2016 stack: cflinuxfs2 buildpack: binary_buildpack

    state since CPU memory disk details #0 running 2016-07-21 03:49:58 PM 2.9% 13.5M of 1G 12.9M of 1G

    If your average memory usage exceeds 50% or your CPU consistently averages over 85%, add more instances with cf scale metrics-ingestor -i INSTANCE-NUMBER.

    In general, you should scale the Ingestor app by adding additional instances. However, you can also scale the Ingestor app by increasing the amount of memory per instance:

    $ cf scale metrics-ingestor -m NEW-MEMORY-LIMIT

    For more information about scaling app instances, see the Scaling an Application Using cf scale topic.

Configuring the Logqueues

PCF Metrics deploys a MySQL Logqueue and an Elasticsearch Logqueue as apps within PCF. The MySQL logqueue consumes metrics from the Ingestor and forwards them to MySQL. The Elasticsearch logqueue consumes logs from the Ingestor and forwards them to Elasticsearch. To customize PCF Metrics for high capacity, you can scale the number of Logqueue app instances and increase the amount of memory per instance.

Considerations for Scaling

The number of MySQL and Elasticsearch logqueues needed is dependent on the frequency of logs and metrics being forwarded by the Ingestor. As a general rule, for every 45,000 logs per minute, add 2 Elasticsearch logqueues. For every 17,000 metrics per minute, add 1 MySQL Logqueue. This is a general estimate and you may need fewer instances depending on your deployment. To optimize resource allocation, provision fewer instances initially and increase instances until you achieve desired performance.

Procedures for Scaling

To modify your Elasticsearch Logqueue app instances, you must first target your Cloud Controller, log in with your UAA administrator credentials, and target the metrics space by following steps 1-3 in the previous section.

To scale your Logqueue app instances, perform the following command:

$ cf scale elasticsearch-logqueue -i INSTANCE-NUMBER

To scale the memory limit per Logqueue app instance, perform the following command:

$ cf scale elasticsearch-logqueue -m NEW-MEMORY-LIMIT

To modify your MySQL Logqueue app instances, you must first target your Cloud Controller, log in with your UAA administrator credentials, and target the metrics space by following steps 1-3 in the previous section.

To scale your Logqueue app instances, perform the following command:

$ cf scale mysql-logqueue -i INSTANCE-NUMBER

To scale the memory limit per Logqueue app instance, perform the following command:

$ cf scale mysql-logqueue -m NEW-MEMORY-LIMIT

WARNING: If you decrease the number of Logqueue instances, you may lose data currently being processed on the instances you eliminate.

Create a pull request or raise an issue on the source for this page in GitHub