Context

Collecting application metrics is the first step towards gaining insights into Policy Fwk services and infrastructure from point of view of Availability, Performance, Reliability and Scalability.

The goal of monitoring is to achieve the below operational needs:

  • Monitoring via dashboards: Provide a visual aid to display health and key metrics for use by OPS.
  • Alerting: Something is broken, and the issue must be addressed immediately OR, something might break soon, and proactive measures are taken to avoid such a situation.
  • Conducting retrospective analysis: Rich information that is readily available to better troubleshoot issues.
  • Analyzing trends: How fast is it the usage growing? How is the incoming traffic like? Helps assess needs for scaling to meet forecasted demands.

Policy Framework Key Metrics

The principles outlined in the Four Golden Signals developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Fwk components: API, PAP, Policy-Distribution, Policy-DB, PDPs (APEX, Drools, XACML).

  • Request Rate - Number of requests, per second as served by Policy services i.e. by API, PAP. Number of requests/events, per second as processed by the PDPs
  • Errors - Number of those requests/events processed that are failing
  • Latency/Duration (expressed as time interval) -  Amount of time those requests take, and for PDPs relevant metrics denoting the event processing times
  • Saturation - Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain.

System Metrics that apply to all Policy components

These metrics are available and exposed via a Prometheus endpoint since Istanbul release. 

Note: Standard metrics are already exposed for Policy DB (MariaDB) via common charts.

MetricPrometheus Query
Memory usagerate(jvm_memory_bytes_used[30s])*100
CPU Usagerate(process_cpu_seconds_total[30s])*100
JVM threads

jvm_threads_current
jvm_threads_daemon

Process uptimeprocess_start_time_seconds
Garbage Collectors

GCs per second: rate(jvm_gc_collection_seconds_sum[1m])

Avg GC time: rate(jvm_gc_collection_seconds_sum[1m]) / rate(jvm_gc_collection_seconds_count[1m])

Note: SSL certificate expiry is a key metric to alert on, however this can be dealt with outside the scope of Policy Fwk.

Key metrics for Policy API

MetricMetric available?

Exposed via Prometheus endpoint?

Comment
Availability of policy-api serviceYesYes

Exposed by policy-api healthcheck and policy-pap consolidated healthcheck.

Latency


YesYes

To be implemented for all CRUD endpoints exposed by policy-api.

Sample s3p numbers for policy-api stress tests.

Successful API request counterYesYes

Prometheus query for Number of successful API calls per minute

Failed API request counterYesYes

Prometheus query for Number of API calls with non 20* family of status codes per minute

Key metrics for Policy PAP

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of the policy-pap serviceYesYes

policy-pap healthcheck API

Successful API request counter

YesYes

To be implemented for all the endpoints exposed by policy-pap.

Sample s3p numbers for policy-pap stress tests. 

Failed API request counter

YesYes

To be implemented for all the endpoints exposed by policy-pap.

Number of API calls with non 200 family of status codes per minute

Latency

YesYes

To be implemented for all the endpoints exposed by policy-pap.

Policy deployment statistics

policyDeployFailureCount
policyDeploySuccessCount
totalPolicyDeployCount

YesYes

Sample:

GET /policy/pap/v1/statistics
{
    "code": 200,
    "policyDeployFailureCount": 0,
    "policyDeploySuccessCount": 0,
    "policyDownloadFailureCount": 0,
    "policyDownloadSuccessCount": 0,
    "totalPdpCount": 0,
    "totalPdpGroupCount": 0,
    "totalPolicyDeployCount": 0,
    "totalPolicyDownloadCount": 0
}

Key metrics for Policy Distribution

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of the policy-distribution serviceYesYes

Exposed by policy-distribution healthcheck and consolidated policy-pap healthcheck

Successful API request counter

YesYes

To be implemented for all the endpoints exposed by policy-distribution.

Sample s3p numbers for policy-distribution stress tests. 

Failed API request counter

YesYes

To be implemented for all the endpoints exposed by policy-distribution.

Number of API calls with non 200 family of status codes per minute

Latency

YesYes

To be implemented for all the endpoints exposed by policy-distribution.

Policy distribution statistics

distributions
distribution_complete_ok
distribution_complete_fail
downloads
downloads_ok
downloads_error

YesYes

Key metrics for Policy APEX PDP

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of policy-apex-pdpYesYes

Exposed by policy-apex-pdp healthcheck and policy-pap consolidated healthcheck.

TOSCA Policy Deployment counter (per apex-pdp instance)

policyDeployCount
policyDeploySuccessCount
policyDeployFailCount

YesYes

Exposed by policy-pap statistics

GET /policy/pap/v1/statistics/defaultGroup/apex
{
  "defaultGroup": {
    "apex": [
      {
        "pdpInstanceId": "devdev-policy-apex-pdp-0",
        "timeStamp": "2021-09-07T20:10:52.242Z",
        "pdpGroupName": "defaultGroup",
        "pdpSubGroupName": "apex",
        "policyDeployCount": 2,
        "policyDeploySuccessCount": 2,
        "policyDeployFailCount": 0,
        "policyExecutedCount": 0,
        "policyExecutedSuccessCount": 0,
        "policyExecutedFailCount": 0,
        "engineStats": [
          {
            "engineId": "NSOApexEngine-0:0.0.1",
            "engineWorkerState": "READY",
            "engineTimeStamp": 1630550345549,
            "eventCount": 0,
            "lastExecutionTime": 0,
            "averageExecutionTime": 0,
            "upTime": 0,
            "lastEnterTime": 0,
            "lastStart": 1630550345549
          },
          ......
        ]
      }
    ]
  }
}




TOSCA Policy Execution counter (per apex-pdp instance)

# of policies executed
# of policies executed with success status
# of policies executed with a failure status

YesYes

Engine stats (by engineID per apex-pdp instance)

eventCount: number of APEX events processed
engineWorkerState: possible values defined in AxEngineState
averageExecutionTime: average time taken to process an APEX policy
lastExecutionTime: time taken to process the last APEX policy
lastStart: time at which the policy engine was last started, uptime is derived from this metric

YesYes

Latency

YesYes

Time taken for processing an incoming APEX event 

*Note: the stats currently displays execution time for processing APEX policy, and is a measure of system saturation and is sufficient

Kafka consumer lag

NoNo

Can be implemented outside of the Policy FWK.

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to apex-pdp

Key metrics for Policy Drools PDP

*Note: Drools PDP counters are exposed on a per controlloop implementation basis.

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of policy-drools-pdpYesNo

Exposed by policy-drools-pdp healthcheck and policy-pap consolidated healthcheck.

Telemetry feature-lifecycle status API
http://localhost:9696/policy/pdp/engine> get /policy/pdp/engine/lifecycle/state
HTTP/1.1 200 OK
Content-Length: 8
Content-Type: application/json
Date: Thu, 11 Nov 2021 16:36:13 GMT
Server: Jetty(9.4.33.v20201020)

"ACTIVE"

Policy Deployment counter (per drools-pdp instance)

policyDeployCount
policyDeploySuccessCount
policyDeployFailCount

YesNo

Sample:

GET /policy/pap/v1/statistics/defaultGroup/drools
{
   "defaultGroup":{
      "drools":[
         {
            "pdpInstanceId":"dev-policy-drools-pdp-0",
            "timeStamp":"2021-09-07T20:09:34.160Z",
            "pdpGroupName":"defaultGroup",
            "pdpSubGroupName":"drools",
            "policyDeployCount":54,
            "policyDeploySuccessCount":54,
            "policyDeployFailCount":0,
            "policyExecutedCount":1,
            "policyExecutedSuccessCount":1,
            "policyExecutedFailCount":0,
            "engineStats":[

            ]
         }
      ]
   }
}

Policy Execution counter (per drools-pdp instance)

policyExecutedCount
policyExecutedSuccessCount
policyExecutedFailCount

YesNo

Latency

NoNoTime taken for an incoming event to be processed by drools controller.

Count of Drools facts

NoNoAn ever increasing number of drools facts can lead to an Out of memory.
Kafka consumer lagNoNo

Can be implemented external to the policy FWK

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to drools

Key metrics for Policy XACML PDP

TODO: The statistics exposed can be more granular

MetricMetric available?Exposed via Prometheus endpoint?Comment
Availability of policy-xacml-pdpYesNo

Exposed by policy-pap consolidated healthcheck. Additionally, also exposed by the XACML healthcheck API

GET /policy/pdpx/v1/healthcheck
~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/healthcheck' --header 'Authorization: Basic ****'
{
  "name": "Policy Xacml PDP",
  "url": "self",
  "healthy": true,
  "code": 200,
  "message": "alive"
}

Policy Deployment counter

totalPoliciesCount
totalPolicyTypesCount

YesNo

XACML PDP statistics API

GET /policy/pdpx/v1/statistics
~ $ curl --location --request GET 'http://policy-xacml-pdp.fra-fireants-dev.svc.cluster.local:6969/policy/pdpx/v1/statistics' --header 'Authorization: Basic ****'
{
  "code": 200,
  "totalPolicyTypesCount": 18,
  "totalPoliciesCount": 1,
  "totalErrorCount": 0,
  "permitDecisionsCount": 0,
  "denyDecisionsCount": 0,
  "indeterminantDecisionsCount": 0,
  "notApplicableDecisionsCount": 1
}

Policy execution error counter

totalErrorCount

YesNo

Policy execution success counter by type

permitDecisionsCount
denyDecisionsCount
indeterminantDecisionsCount
notApplicableDecisionsCount

YesNo
LatencyNoNo

Time taken for an incoming event to be processed via the XACML policies.

Kafka consumer lagNoNo

Can be implemented external to the policy FWK

Monitor kafka consumer lag increase for kafka/dmaap-message-router topics related to XACML

  • No labels

2 Comments

  1. Nice job Rashmi - a few comments:

    • for the jvm stats: cpu, memory, etc .. can we use them in terms of percentage rather than absolute values?   they will vary install by install in the resources configuration that are allocated to the pods..    There is another level of monitoring at the kubernetes level that should be monitoring pods resources as well.
    • for the api:  similar to the pdps, instead of keeping track of rates, it can simply count sucesses and failures in absolute terms.   this will translate into a prometheus counter, and then rate queries against prometheus can derive the rate for any given time interval.   It will be good to have a distinction about the types of queries and operations against the api component, and we know that some of them are very "expensive" to run.
    • for apex:  I have an observation about engineStats substructure of the statistics message, the PdpStatistics message is a global data structure, but "EngineStats" is very apex oriented, so reuse across drools or xacml is not very applicable.
    • for drools: we could keep counts (success/failure/latency) on a granular basis on per control loop.
    • for xacml: we could have more granularity in terms of the xacml application.
    1. Thank you so much for your feedback Jorge Hernandez .

      1. System metrics (jvm stats, cpu, mem): Indeed I will correct the Prometheus query, I actually forgot to change it to % for some of those metrics.
      2. API: That is true. It seems like I have mixed up the Metric column between those exposed by the application and the intended Prometheus query. Like you mention if we have the counters, a rate query can be easily built using them. Will clean it up.
      3. APEX: Yes engineStats are very specific to APEX and do not apply to the other PDPs. I also think some of those attributes can be used to update the global PdpStatistics data structure correctly. I have retained it only for APEX because those metrics are exposed today and maybe some of those are useful to monitor.
      4. Drools: If we can have counters per controlloop instance that would be great. I will add that little detail in. And if it makes sense we can simply include a sum of all the counters as applicable using prometheus.
      5. XACML: I am not very familiar with XACML, so my approach towards this was to look at what is currently exposed and look for gaps if any. If you see that something could be added or removed, we can work together to define those as metrics. This will be a live page anyway and can be updated.