LATEST VERSION: 1.4 - CHANGELOG
PCF Metrics v1.4

Sizing PCF Metrics for Your System

This topic describes how operators configure Pivotal Cloud Foundry (PCF) Metrics depending on their deployment size. Operators can use these procedures to optimize PCF Metrics for high capacity or to reduce resource usage for smaller deployment sizes.

After your deployment has been running for a while, use the information in this topic to scale your running deployment.

If you are not familiar with the PCF Metrics components, review PCF Metrics Product Architecture before reading this topic.

For how to configure resources for a running deployment, see the procedures below:

Suggested Sizing by Deployment Size

Use the following tables as a guide for configuring resources for your deployment.

Estimate the size of your deployment according to how many apps are expected to be deployed.

SizePurposeApproximate number of apps
SmallTest use100
MediumProduction use5,000
LargeProduction use15,000

If you are using Metrics Forwarder and custom metrics, you might need to scale up the MySQL Server instance more than indicated in the tables below. Pivotal recommends you start with the one of the following configurations and scale up as necessary by following the steps in Configuring the Metrics Datastore.

Deployment Resources for a Small Deployment

This table lists the resources you need to configure for a small deployment, about 100 apps.

JobInstancesPersistent Disk TypeVM Type
Elasticsearch Master110 GBsmall (cpu: 1, ram: 2 GB, disk: 8 GB)
Elasticsearch Data110 GBsmall (cpu: 1, ram: 2 GB, disk: 8 GB)
Redis110 GBmicro (cpu: 1, ram: 1 GB, disk: 8 GB)
MySQL Server1 (not configurable)10 GBsmall (cpu: 1, ram: 2 GB, disk: 8 GB)

Deployment Resources for a Medium Deployment

This table lists the resources you need to configure for a medium deployment, about 5000 apps.

JobInstancesPersistent Disk TypeVM Type
Elasticsearch Master110 GBsmall (cpu: 1, ram: 2 GB, disk: 8 GB)
Elasticsearch Data5200 GBsmall.disk (cpu: 1, ram: 2 GB, disk: 16 GB)
Redis110 GBsmall.disk (cpu: 1, ram: 2 GB, disk: 16 GB)
MySQL Server1 (not configurable)500 GBmedium (cpu: 2, ram: 4 GB, disk: 8 GB)

Deployment Resources for a Large Deployment

This table lists the resources you need to configure for a large deployment, about 15,000 apps.

JobInstancesPersistent Disk TypeVM Type
Elasticsearch Master110 GBsmall (cpu: 1, ram: 2 GB, disk: 8 GB)
Elasticsearch Data10500 GB large (cpu: 2, ram: 8 GB, disk: 16 GB)
Redis110 GBlarge (cpu: 2, ram: 8 GB, disk: 16 GB)
MySQL Server1 (not configurable)2  TBlarge (cpu: 2, ram: 8 GB, disk: 16 GB)

Scale the Metrics Datastore

PCF Metrics stores metrics in a single MySQL node. For PCF deployments with high app logs load, you can add memory and persistent disk to the MySQL server node.

Considerations for Scaling the Metrics Datastore

While the default configurations in Suggested Sizing by Deployment Size above are a good starting point for your MySQL server node, they do not take into account the additional load from custom metrics. Pivotal recommends evaluating performance over a period of time and scaling upwards as necessary. As long as persistent disk is scaled up, you won’t not lose any data from scaling.

Procedure for Scaling

Do the following to scale up the MySQL server node:

To scale up the MySQL server node, do the following:

  1. Determine how much memory and persistent disk are required for the MySQL server node.
  2. Navigate to the Ops Manager Installation Dashboard and click the Metrics tile.
  3. From the Settings tab of the Metrics tile, click Resource Config.
  4. Enter the values for the Persistent Disk Type and VM Type.
  5. Click Save.

WARNING! If you are using PCF v1.9.x and earlier, there might be issues Ops Manager BOSH Director using persistent disks larger than 2 TB.

Scale the Log Datastore

PCF Metrics uses Elasticsearch to store logs. Each Elasticsearch node contains multiple shards of log data, divided by time slice.

Considerations for Scaling

Pivotal suggests starting with the default configurations for your Elasticsearch Master and Data nodes, observing the disk usage of the Data nodes, and then scaling the resources accordingly.

The following calculation attempts to measure Elasticsearch resource requirements more precisely depending on your logs load. This formula is only an approximation, and Pivotal suggests rounding the numbers up as a safety measure against undersizing Elasticsearch:

  1. Determine how many logs the apps in your deployment emit per hour (R) and the average size of each log (S).

  2. Calculate the number of instances (N) and the persistent disk size for the instances (D) you need to scale to using the following formula:

          R × S × 336 × 2 = N × D

    The formula assumes that a log retention period is 336 hours (2 weeks), and the number of Elasticsearch replica shards is 1 (default).

    For example:

          200,000 logs/hr × 25 KB × 336 hr × 2 ≈ 7 instances × 500 GB
    or
          200,000 logs/hr × 25 KB × 336 hr × 2 ≈ 3 instances × 1 TB

As stated above, the default Elasticsearch configuration sets the number of replicas to 1, which means that every shard has 1 replica and Elasticsearch logs are HA by default.

Procedure for Scaling

WARNING! If you modify the number of Elasticsearch instances, the Elasticsearch cluster temporarily enters an unhealthy period during which it does not ingest any new logs data, due to shard allocation.

After determining the number of Elasticsearch nodes needed for your deployment, perform the following steps to scale your nodes:

  1. Navigate to the Ops Manager Installation Dashboard and click the Metrics tile.
  2. From the Settings tab of the Metrics tile, click Resource Config.
  3. Locate the Elasticsearch Data job and select the dropdown menu under Instances to change the number of instances.
  4. Click Save.

Scale the Temporary Datastore (Redis)

PCF Metrics uses Redis to temporarily store ingested data from the Loggregator Firehose as well as cache data queried by the Metrics API. The former use case is to prevent major metrics and logs loss when the data stores (Elasticsearch and MySQL) are unavailable. The latter is to potentially speed up front-end queries. See PCF Metrics Product Architecture for more information.

Considerations for Scaling

The default Redis configuration specified in Suggested Sizing by Deployment Size above that fits your deployment size should work for most cases. Redis stores all data in memory, so if your deployment size requires it, you can also consider scaling up the RAM for your Redis instance(s). You can additionally increase the number of Redis instances to 2 if you need HA behavior when Redis upgrades.

Procedure for Scaling

Follow these steps to configure the size of the Redis VM for the temporary datastore based on your calculations.

Note: In the case that the temporary datastore becomes full, Redis uses the volatile-ttl eviction policy to continue storing incoming logs. For more information, see Eviction policies in Using Redis as an LRU cache.

  1. Navigate to the Ops Manager Installation Dashboard and click the Metrics tile.
  2. From the Settings tab, click Resource Config.
  3. Locate the Redis job and select the dropdown menu under Instances to scale Redis up or down.
  4. Click Save.

Scale the Ingestor, Logqueues, and Metrics API

The procedures for scaling the Ingestor, Elasticsearch logqueue, MySQL logqueue, and Metrics API instances are similar.

  • Ingestor — PCF Metrics deploys the Ingestor as an app, metrics-ingestor, within PCF. The Ingestor consumes logs and metrics from the Loggregator Firehose, sending metrics and logs to their respective Logqueue apps.

    To customize PCF Metrics for high capacity, you can scale the number of Ingestor app instances and increase the amount of memory per instance.

  • Logqueues — PCF Metrics deploys a MySQL Logqueue and an Elasticsearch Logqueue as apps, mysql-logqueue and elasticsearch-logqueue, within PCF. The MySQL logqueue consumes metrics from the Ingestor and forwards them to MySQL. The Elasticsearch logqueue consumes logs from the Ingestor and forwards them to Elasticsearch.

    To customize PCF Metrics for high capacity, you can scale the number of Logqueue app instances and increase the amount of memory per instance.

    The number of MySQL and Elasticsearch logqueues needed is dependent on the frequency that logs and metrics are forwarded by the Ingestor. As a general rule:

    • For every 45,000 logs per minute, add 2 Elasticsearch logqueues.
    • For every 17,000 metrics per minute, add 1 MySQL logqueue.

    The above is a general estimate. You might need fewer instances depending on your deployment. To optimize resource allocation, provision fewer instances initially and increase instances until you achieve desired performance.

  • Metrics API — PCF Metrics deploys the app, metrics, within PCF.

Refer to this table to determine how many instances you need for each component.

Item Small Medium Large
Ingestor instance count Number of Doppler servers Number of Doppler servers Number of Doppler servers
MySQL logqueue instance count 1 1 2
ES logqueue instance count 1 2 3
Metrics API instance count 1 2 2

Find the number of Doppler servers in the Resource Config pane of the Pivotal Elastic Runtime tile.

Considerations for Scaling

Pivotal recommends starting with the configuration in Suggested Sizing by Deployment Size above, for your deployment size and then evaluating performance over a period of time and scaling upwards if performance degrades.

Procedure for Scaling

WARNING! If you decrease the number of instances, you might lose data currently being processed on the instances you eliminate.

After determining the number of instances needed for your deployment, perform the following steps to scale:

  1. Target your Cloud Controller with the Cloud Foundry Command Line Interface (cf CLI). If you have not installed the cf CLI, see Installing the cf CLI.

    $ cf api api.YOUR-SYSTEM-DOMAIN
    Setting api endpoint to api.YOUR-SYSTEM-DOMAIN...
    OK
    API endpoint:   https://api.YOUR-SYSTEM-DOMAIN (API version: 2.54.0)
    Not logged in. Use 'cf login' to log in.
    

  2. Log in with your UAA administrator credentials. To retrieve these credentials, navigate to the Pivotal Elastic Runtime tile in the Ops Manager Installation Dashboard and click Credentials. Under UAA, click Link to Credential next to Admin Credentials and record the password.

    $ cf login
    API endpoint: https://api.YOUR-SYSTEM-DOMAIN

    Email> admin Password> Authenticating... OK

  3. When prompted, target the metrics space.

    Targeted org system

    Select a space (or press enter to skip): 1. system 2. notifications-with-ui 3. autoscaling 4. metrics

    Space> 4 Targeted space metrics

    API endpoint: https://api.YOUR-SYSTEM-DOMAIN (API version: 2.54.0) User: admin Org: system Space: metrics

  4. List the apps that are running in the metrics space.

    $ cf apps
    Getting apps in org system / space metrics as admin...
    OK
    name requested state instances memory disk urls elasticsearch-logqueue started 1/1 256M 1G metrics started 1/1 1G 2G metrics.YOUR-SYSTEM-DOMAIN/api/v1 metrics-ingestor started 1/1 256M 1G metrics-ui started 1/1 64G 1G metrics.YOUR-SYSTEM-DOMAIN mysql-logqueue started 1/1 512M 1G

  5. Scale the app to the desired number of instances:

    cf scale APP-NAME -i INSTANCE-NUMBER

    Where the APP-NAME is elasticsearch-logqueue, metrics, metrics-ingestor, or mysql-logqueue.
    For example, to scale all the apps:

    $ cf scale elasticsearch-logqueue -i 2
    $ cf scale metrics -i 2
    $ cf scale metrics-ingestor -i 2
    $ cf scale mysql-logqueue -i 2

  6. Evaluate the CPU and memory load on the instances:

    cf app APP-NAME

    For example,

    $ cf app metrics-ingestor
    Showing health and status for app metrics-ingestor in org system / space metrics as admin...
    OK
    
    requested state: started instances: 1/1 usage: 1G x 1 instances urls: last uploaded: Sat Apr 23 16:11:29 UTC 2016 stack: cflinuxfs2 buildpack: binary_buildpack

    state since cpu memory disk details #0 running 2016-07-21 03:49:58 PM 2.9% 13.5M of 1G 12.9M of 1G

  7. If your average memory usage exceeds 50% or your CPU consistently averages over 85%, add more instances with cf scale APP-NAME -i INSTANCE-NUMBER.

    In general, you should scale the app by adding additional instances. However, you can also scale the app by increasing the amount of memory per instance:

    cf scale APP-NAME -m NEW-MEMORY-LIMIT
    

    For example,

    $ cf scale metrics-ingestor -m 2G

    For more information about scaling app instances, see Scaling an Application Using cf scale.

Create a pull request or raise an issue on the source for this page in GitHub