Baseline Measure of Failed Requests for a Component Failure within a Site

The rule management module of Holmes is mainly used for user interactions, especially for the lifecycle control of the rules. Since this module is not quite frequently used, and the rule deployment is usually done manually. There are not too many requests sending to this module. Besides, all request could be re-sent once the module is recovered. Hence for this module, the request failures during the component failure could be reasonably ignored.

As for the engine management module, as it is the core module of Holmes and it is mainly responsible for alarm processing, the request loss could be considered the same as the data loss. So this will be evaluated in the next section.

Baseline of Data Loss for a Component Failure within a Site

The average start-up time of Holmes is around 20 seconds. All the estimation of the baseline will be based on such an approximation.

Because the performance of Holmes differs a lot with and without A&AI. Both scenarios should be taken into account when identifying the baseline.

Data Analysis with AAI

From the testing report provided on the Testing Results page, when AAI is involved, correlation analysis for alarms is extremely slow. Even if we take the best performance data as a reference, the data loss during the start-up of Holmes is

1.73 alarms/s * 20s = 34.5 alarms


Data Analysis without AAI

Now, let's move forward to the scenario under which AAI is not involved during the correlation analysis.

According to the Testing Results, the peak rate for alarm processing is around 350 alarms per second. Based on this approximation, the data loss under such a scenario is

350 alarms/s * 20s = 7000 alarms

From the figures above, we can see that the data loss could be far beyond the acceptance of the users. So in the future, the Holmes team has to optimize the component to reduce data loss.

Here are two suggestions:

  1. Cache the AAI data and refresh them periodically so that Holmes won't have to make an HTTP call to AAI every time it tries to correlate one alarm to another.
  2. DCAE and it's inferior systems (e.g. PNFs and VNFs, etc.) provide data synchronization mechanism to ensure that components could fetch data spontaneously after they are deployed or restarted.
  • No labels

1 Comment

    1. Cache the AAI data and refresh them periodically so that Holmes won't have to make an HTTP call to AAI every time it tries to correlate one alarm to another.

    Guangrong Fu,

    The problem for caching is how to know when to update the cached data. Even though the access time may be fast for Holmes, the risk is using out-of-date data, so the correlations will be wrong anyway. Also, duplicating the AAI data outside of AAI is probably a bad architectural decision. Making AAI faster for these use cases would be better.

    Has there been a performance analysis of where the time is spent? Could it help to use ElasticSearch (e.g. as in sparky)? Should Holmes have a batch interface to get more AAI data in fewer calls? Or a better correlation API that results in fewer calls?

    Would it be worthwhile to discuss at the AAI meeting, James Forsyth?