This documents the current draft for how to handle requirements that would be of interest to the ONAP operators as they implement ONAP into production.

Inputs

Notes from the Tuesday September 26th discussion at ONAP meeting

R2 Proposed Non-Functional Requirements

Draft Architecture Principles

ATT Review of ONAP Carrier Grade Requirements.pptx

Presentations

ONAP-Carrier Grade for TSC 19October2017.pptx

Software Architecture 11December2017.pdf

Approved Platform Maturity Requirements for Beijing

Platform Maturity Level proposal 13Dec2017v2.pdf

(Approved by the TSC at the Santa Clara meeting)


General Approach

  • The goal of this effort is to define requirements to enable ONAP for carrier implementations.  It is not to deliver a specified carrier-grade configuration of ONAP, but to build all the software hooks necessary for an operator to deliver a 5-9’s carrier grade environment at their own expense
  • Process
    • For each category of carrier-grade requirements, multiple levels of requirements will be established and presented to the TSC.
    • The Architecture Committee, in cooperation with the project teams, will establish guidelines for requirement levels that must be met by each project for each release.  The required level may be influenced by: MVP project status, desired project maturity level, release inclusion, component criticality (run-time vs. design time).

Performance

  • Level 0: no performance testing done
  • Level 1: baseline performance criteria identified and measured  (such as response time, transaction/message rate, latency, footprint, etc. to be defined on per component)
  • Level 2: performance improvement plan created & implemented for 1 release (improvement measured for equivalent functionality & equivalent hardware)
  • Level 3: performance improvement plan implemented for 2 consecutive releases (improvements in each release)

Stability

  • Level 0: none beyond release requirements
  • Level 1: 72 hour component-level soak test (random test transactions with 80% code coverage; steady load)
  • Level 2: 72 hour platform-level soak test (random test transactions with 80% code coverage; steady load)
  • Level 3: track record over 6 months of reduced defect rate

Resiliency

  • Level 0: no redundancy
  • Level 1: support manual failure detection & rerouting or recovery within a single site; tested to complete in 30 minutes
  • Level 2: support automated failure detection & rerouting 
    • within a single geographic site
    • stateless components: establish baseline measure of failed requests for a component failure within a site 
    • stateful components: establish baseline of data loss for a component failure within a site
  • Level 3: support automated failover detection & rerouting 

    • across multiple sites 

    • stateless components 

      • improve on # of failed requests for component failure within a site 

      • establish baseline for failed requests for site failure 

    • stateful components 

      • improve on data loss metrics for component failure within a site 

      • establish baseline for data loss for site failure

  • These levels may drive the need for a common platform for resiliency & approaches to consistently provide resiliency across ONAP. Such a platform might contain: 
    1. a geo-distributed database that supports both within and cross-site state replication
    2. a failover mechanism that performs failure detection, request rerouting and the actual failover and 
    3. a site/replica selection service that picks among the appropriate replicas during request rerouting.  

Security

Project-level requirements

  • Level 0: None
  • Level 1: CII Passing badge
  • Level 2: CII Silver badge, plus:
    • All internal/external system communications shall be able to be encrypted.
    • All internal/external service calls shall have common role-based access control and authorization.
  • Level 3: CII Gold badge 

ONAP Platform-level requirements per release 

  • Level 1: 70 % of the projects passing the level 1 
    • with the non-passing projects reaching 80% passing level
    • Non-passing projects MUST pass specific cryptography criteria outlined by the Security Subcommittee*
  • Level 2: 70 % of the projects passing silver 
    • with non-silver projects completed passing level and 80% towards silver level
  • Level 3: 70% of the projects passing gold 
    • with non-gold projects achieving silver level and achieving 80% towards gold level
  • Level 4: 100 % passing gold.


Scalability

  • Level 0: no ability to scale
  • Level 1: supports single site horizontal scale out and scale in, independent of other components
  • Level 2: supports geographic scaling, independent of other components
  • Level 3: support scaling (interoperability) across multiple ONAP instances


Manageability

  • Level 1:
    • All ONAP components will use a single logging system.
    • Instantiation of a simple ONAP system should be accomplished in <1 hour with a minimal footprint
  • Level 2:
    • A component can be independently upgraded without impacting operation interacting components
    • Transaction tracing across components
    • Component configuration to be externalized in a common fashion across ONAP projects


Usability

  • Level 1
    • User guide created
    • Deployment documentation
    • API documentation
    • Adherence to coding guidelines
  • Level 2
    • Consistent UI across ONAP projects
    • Usability testing conducted
    • Tutorial documented


*Specific cryptopgraphy requirements for security level 1:

  • The software produced by the project MUST use, by default, only cryptographic protocols and algorithms that are publicly published and reviewed by experts (if cryptographic protocols and algorithms are used).
  • If the software produced by the project is an application or library, and its primary purpose is not to implement cryptography, then it SHOULD only       call on software specifically designed to implement cryptographic functions; it SHOULD NOT re-implement its own.
  • The security mechanisms within the software produced by the project MUST use default keylengths that at least meet the NIST minimum requirements       through the year 2030 (as stated in 2012). It MUST be possible to configure the software so that smaller keylengths are completely       disabled.
  • The default security mechanisms within the software produced by the project MUST NOT depend on broken cryptographic algorithms (e.g., MD4, MD5,       single DES, RC4, Dual_EC_DRBG) or use cipher modes that are inappropriate to the context (e.g., ECB mode is almost never appropriate because it       reveals identical blocks within the ciphertext as demonstrated by the ECB penguin, and CTR  mode is often inappropriate because it does not perform authentication       and causes duplicates if the input state is repeated).
  • The default security mechanisms within the software produced by the project SHOULD NOT depend on cryptographic algorithms or modes with known serious       weaknesses (e.g., the SHA-1 cryptographic hash algorithm or the CBC mode in SSH).
  • If the software produced by the project causes the storing of passwords for authentication of external users, the passwords MUST be       stored as iterated hashes with a per-user salt by using a key stretching (iterated) algorithm (e.g., PBKDF2, Bcrypt or Scrypt).


  • No labels

45 Comments

  1. These are the notes I captured from the session on 9/27/17.  These are NOT my comments, but what was discussed by the attendees:

    • Don’t use Carrier-Grade term; use S3P?
    • Need to better define soak test; not just idling for that time
    • Would like to see some sequential order and common base, consolidated infrastructure and footprint; prioritize the efforts; would like to see stability as a high priority
    • How do you measure a reduced defect rate when there is new functionality rolling out
    • Availability: doesn’t like the failover time; has to be some sort of software requirement around these;
    • Availability: Source of 100ms?  None
    • Availability: Don’t assume implementation in the requirements; remove implementation wording
    • Performance: what performance is good enough?  A 5G use case might have different performance requirements
    • Availability: differentiate about server vs site failures
    • Availability: do we need a reference implementation for performance measurements?  Do we need a service-level metric?  Failure scenarios: software instance, hardware, etc.
    • Availability: multiple ways to achieve availability, not limited to failover; these aren’t availability requirements
    • Stability: need it now for Amsterdam Release
    • Availability: Five 9s is very expensive and up to the operators to implement; need the hooks in the software to allow operators to implement it, such as data replication
  2. Now, my comments, particularly on availability.  I agree with most of the comments made about the availability section.  As discussed, it will not be up to the ONAP project to have or prove 5 9's availability (as it is very expensive for infrastructure and time consuming to do availability proofs) but we need to make progress in the implementation of components to allow operators to implement ONAP to the degree of availability that they require.  So, the intention was not to test that component X could do 100ms failure detection & recovery, but that component X has the design/architecture that would support someone doing failure detection & recovery in that timeframe with the right infrastructure.  

    Ultimately, I think this means that certain ONAP components will need to implement techniques like cloud-native guidelines so that something (like Kubernetes) can do the failure detection and recovery for them.  And for stateful components, they should leverage existing distributed data stores that can support the right replication to handle site & geographic failures.

    So, how do we write the requirements to drive the projects to implementing a resilient design?

    1. Jason, I could not agree more. Indeed we have requirements for cloud native designs for VNFs. ONAP is a set of management VNFs after all !! So, we can embrace some of the same cloud native design requirements for data/control plane VNFs. Then scalability and availability can be configured by an operator flexibly depending on their goals and affordability

    2. May be more helpful to think in terms of tests than requirements. Test driven design can bake in the conformance through the Ci/CD toolchain.  So for resiliency improvement as an example,  a set of tests that fails every component one by one would seem to require the system to be resilient to single points of failure.  The performance during failure recovery  (e.g. MTTR) is then a secondary measure assuming that the system does recover from this set of failures.  Need to define other tests for the other dimensions of carrier grade - performance, stability, security etc. 

      As the set of failures becomes large, a random element (e.g. a chaos Monkey) can provide statistical coverage of a large field of tests.

  3. A few thoughts that might help the discussion.

    Seems like we need to make sure we also cover three things: Resiliency, Serviceability , Operability and have a Test program appropriate for each.

    We also need to assess for each component what level of Availability is nneded since not everything is at the same criticality to service providers. Design time components may have a different availability investment choice than run time components

    Resiliency should be both internal clustering/HA and site diversity. We can and should embrace K8 and docker for this but is not sufficient since state sensitive applications need to be engineered to take advantage of K8 without breaking.

    A Testing program should be implemented to systematically fail sub-components at the docker container level and confirm resilient reaction and recovery. This would also be the program for a 72 hour at 50% load stability test (running 72 hours with no load does not really provide much information). Stability runs generally look for memory leaks and items that are triggered in cron jobs. Finally, once HA and geo redundancy is configured and deployed in a set of test labs we need to excersize the failover for complete sites as well.

    Serviceability is about upgrading ONAP itself by the distribution provider or service provider

    Operability should be where we address issues like logging and FCAPS for ONAP components.


    1. Thanks for the comments, Brian.  Agree that each component may have different levels of availability (or performance or other requirements) which is the motivation for having these various levels.  Then the TSC, Architecture subcommittee and the projects themselves can agree as to the appropriate level for each project.  So, VVF will not be held to the same standards as a controller, for instance.

      You bring up good points about stateful (or state sensitive) components.  It'd be great to delve into this a little further so projects can make the right decisions on how to implement their state.  I see that Kubernetes is doing work on StatefulSets, but it's still in beta.

      Love the idea of a Chaos Monkey approach to the stability testing!

      1. Jason/Brian,

        We have been working on a common resiliency platform for ONAP components and your thoughts on the need for ONAP components to implement resiliency techniques completely resonates with us. Like Brian mentioned, the state management techniques can be a bit tricky because of the inter and intra-site clustering requirements to satisfy certain availability needs. Even failover, often involves complex distributed system concepts (failure detection, leader election, site selection), especially across multiple sites. 

        To prevent each component from hand-crafting its own solution, we believe that there is a need for a common resiliency platform that provides the basic building blocks (a multi-site geo distributed database/datastore,  a load-distribution service, failover mechanisms etc) that different ONAP components can use according to their own availability and resiliency needs. 

        I have brought this up as a non-functional requirement under R2 and hope to talk about it during the Oct 9, use-case subcommittee meeting. 

        1. Bharath, I agree with all of your comments, and that is part of the reason for the above notes about need for common tooling in the "availability" category.

          Could you share more of what you are thinking/working on?

          As for process, we're still working out how we will go from these requirements to what projects have to implement.  I believe the architecture subcommittee will be involved in some fashion, probably making a recommendation for any common resiliency platform.  I imagine that decision would look at what capabilities might already exist in the technology world or already being worked in ONAP (i.e. OOM).

  4. For Availability

         We should translate availability in 9s (three 9s, or four 9s).  That is more commonly used.  Not sure how to measure / verify it before the release.  May be use 9s.

    For Manageability

    • Level 2:  Should be able view a chronological sequence of all log messages for a given request or transaction ID from all modules.

    I think we should select desired platform level goal for next release (say Scalability 2, stability level 3, performance level 1, security level 2, etc.) then translate them into requirement for each module.  E.g. to meet overall platform goal, we need availability of x, stability y, etc.) for App-C.  Those value could be different for different modules. 

    I soak test, we should run some set of test transactions in random order for the duration of soak period. 

    1. Vimal,

      I think proving any level of availability (three 9s +) will be difficult, as those proofs usually require detailed failure mode analysis across the combination of components.  So, I wonder if we shouldn't rename the availability requirement category to resiliency and provide some requirements on levels of resiliency for the project?  Then, as operators implement ONAP, they can do the work to deliver the availability that they require.

      Agree with the approach of platform level goals for Beijing and translating to module-specific goals based on gating/non-gating and design/runtime (or other criteria).

      Good input on manageability and soak tests.  Thank you!  

    2. With so many ONAP services, traceability becomes very critical in trouble shooting - Both at the development time and also at the deployment time. I can't agree more on the need for request ID tracing across Micro-Services.  I feel that request-ID chain needs to be preserved as much as possible. If a request from S1 to S2 results into request from S2 to S3 and S2 to S4.  Then, it should be possible to see the request path at S3 and S4.  it is normally a debate whether the request path to be maintained in HTTP request headers or as part of payload. Payload based request path is getting popular due to other protocols (Subscriber-Publish model) and others.  It requires guidelines and standardization and even abstract API to append request ID to the request-ID path etc. across ONAP project.

      1. The logging project is working on achieving this level of end to end trace-ability.  To achieve this they have published ONAP Application Logging Guidelines v1.1.  

      2. Perhaps something like this would turn out to be helpful here & beat reinvention - this is not necessarily easy to do ?  

        https://www.cncf.io/blog/2016/10/11/opentracing-joins-the-cloud-native-computing-foundation/

        https://github.com/jaegertracing/jaeger

        1. Thanks for forwarding these links.  At high level, they look very relevant.  Will dig deeper.

  5. Usability may also include some metrics about documentation completeness / consistency etc. The various categories of users ( operators, VNF developers etc.)  all need to figure out how to use ONAP to achieve their objectives 

  6. I am not sure if this exactly belongs here but I would like to bring up modularity as a carrier grade requirement. The goal would be, in the long term, to be able to substitute any component of ONAP with other open source or commercial component. This was raised in the meeting during the architecture project presentation and Chris Donley indicated that this is a long term goal but I am not sure this is captured as a explicit north star requirement anywhere and conscious design choices are being made to enable this

  7. A few thoughts to add here as well.

    One aspect that would be beneficial to capture as well is the independence between the modules.   That is that they can scale independently without impacting the operators of others.  The can be upgraded independently without impacting the operators of others.  We can capture this as one chapter, or spread it out.  i.e. we could add upgradability for example, and include the scaling indpendance in scaling.

    Geographic aspects slip into scaling and availability.  This opens up for confusion.  For scaling we could consider to discuss making what is deployed "bigger" or creating a new instance as different levels.  I know the boundry is a little blurry though.  For availability, we could consider to describe continued operation in the case of 1,2, x failure zones.  I don't think that the x 9s approach is fruitful either.

    While we look at high availability, high scalability, under manageability there would also be advantages to be able to have a miniumum foot print deloployment as well, this is to encourage adoption and feedback via making it easy to download and install with little resources.

  8. I'm looking at this slightly form the outside and I'm not sure I understand the performance criteria.   Level 2 seems harder to comply with than level 3, in that is requires a performance objective to be defined and delivered in the space of a single release.   In my experience, once a performance plan is defined and testing built, it tends to form part of a CI/CD framework and it's automatically rolled in to each subsequent release.    Perhaps it's better to think about coverage of plans where Level 2 covers a subset of ONAP functions and level 3 is a full suite level performance plan.  

  9. On availability, do not prescribe solutions but performance metrics (e.g. increasing number of NINES etc. on per level) - in any case, the solutions to accomplish the target performance levels are diverse, and there is no "one right solution", it depends on the specifics of the component/subsystem implementation, and as such should be left to implementors to pick a solution best fit to their specific case. Also,  the availability targets (even if expressed as NINES), can be either: a) time based (availability performance is expressed as fraction of uptime of total time, expressed over some specified period, if period is not specified then general expectation would be rolling metrics over 12 month period - but for SLAs monthly periods are also commonly used in cloud environments), or b.) request fraction based (metric is expected as fraction of the successfully completed API operations for valid requests). We believe that the latter target is more applicable to the most ONAP cases, and it is also the basis for the associated metrics in Tl-9000  ALMA metrics, which was developed along ETSI REL group to start addressing these cases: http://www.tl9000.org/resources/documents/QuEST_Forum_ALMA_Quality_Measurement_150819.pdf (we are not proposing that above would have to be used as normative way, but it is a good overview on how the request based metrics could be specified).

    ONAP should also consider self-monitoring capabilities for the availability performance (as especially API transaction based monitoring is hard to do without having collective measurement point in the path), and also hieararchical monitoring (ONAP API requests can fail for many reasons, including for reasons related to ONAP internal microservice/component availability, as well as for external reasons - e.g. call to external systems such as VIMs fail, which would not be ONAP attributable failure) monitoring at multiple levels helps to determine the cause and attribution.  

  10. Availability (class) targets should be associated with each subsystem based on it's potential impact - for example, control processes and associated controllers should be higher availability targets then the components dealing with new service requests, which themselves should have higher targets than pre-deployment services such as SDC etc. The target setting should be based on the analysis of the service impact / service criticality (risk = impact * probability) analysis. We have done this for the OpenStack services, using the number of instances at risk as well as activation probability of service affecting faults as a proxy for determining the priorities for improvements, and same sort of process could be used here.


  11. Numbers are important but difficult to get from test runs that last a few days at best. We all know the problem that to prove X 9's you need many days of operation. I think focusing on the types of tests we want to be able to pass will be more useful to the developers.

  12. Example SDNC/CCSDK Approach (based on ECOMP setup)

    1. High Availability
      1. Implement an OOM/K8/docker based high availability configuration for a single location/data center
        1. Opendaylight in a 3 node cluster
        2. database in a 2 node cluster (may need to complete refactoring from mysqldb to mariadb)
        3. admin server in a 2 node cluster
        4. dgbuilder does not need to be included since its normally a desktop design tool not on the production sever
        5. Use HAProxy or equivalent load balancer in front of the components
      2. Test of High Availability - conduct health check and API tests after each failure and recovery
        1. fail and recover each odl node
        2. fail and recover each database server 
        3. fail and recover each admin server
        4. fail and recover the HAProxy function
    2. Geo Redundancy
      1. Implement a docker based geo-redundancy configuration for a dual location/data center
        1. This can be two tenants in the same cloud environment or any mechanism where there are no shared components (e.g separate K8 and networking)  in the test lab to show separation
        2. Same HA cluster as above for components
        3. ODL configured for non-voting members between sites for Active/Standby geo redundancy
          1. An active/active configuration is a stretch goal in this release but it is not clear if ODL will have that support in this timeframe
        4. database configured for cross site replication through Galera Cluster or equivalent
        5. Global Load Balancing across site configured in ONAP common architecture 
      2. Test of Geo Redundancy
        1. Fail site A and fail over to site B
        2. Run SDNC regression tests on site B including MACD and new order
        3. Restore Site A
        4. Fail back to site A
        5. Run SDNC regression tests on site A including MAC on orders before and after the failover and new order
        6. Time interval for fail over and fail back should be measured (expected to be the same)
    1. Brian,

      I had presented some work on a common high-availability platform (CHAP) in the use-case subcommittee discussion on October 9th. Specifically, our tools have been designed with some care for point 2.b that you make regarding geo-redundancy. In our view using Galera across different geo-distributed sites may not scale very well because of cross-site transactionality. We propose a solution combination called MUSIC and mdbc that is optimized for geo-redundant replication with flexible consistency. This can be used by the SDN-C controller to maintain state. On another, note we also propose a distributed solution for failover called HAL that can do the job of tracking which requests belong to which cluster and perform failover of the requests when one of them fail. 

  13. Agreed, as the NINES numbers tend to be statistical on nature and over substantial periods of time and/or substantial amount of requests, they are hard to test. This is why the typical engineering level requirements and testing metrics have been usually expressed on more immediately quantifiable metrics, such as failover time in ms, requests lost/failed, and testing using the fault injection processes (kill/disable components, connections, etc.). However, the two have been and can be used in combination, but to tie them together usually was done with modeling (prescriptive models to establish targets and descriptive models to validate the goals are met, followed by in-service performance monitoring to validate the actuals – i.e. lots of work and lots of hard-to-get parameters, although this does not need to be full markov treatment of everything to be useful - simple pocket calculator with RBDs is quite useful and simple tool as well). On non-telco domains, folks like Google establish the service availability targets for each service, monitor the performance in production, and utilize processes such as "stop changing stuff until..."  they get back to the target area - just an example of the different way of thinking, although it is not clear if it can be applied here.  The useful first step - before establishment numeric targets could be to classify functionality and related components to "priority" buckets (whatever the exact definition of them is or will be), combined with tests for the outage lengths and remediation times at component / subsystem level and request monitoring at the higher level (in case that the service support auto-remediation processes, which should likely be the assumed baseline case for all services).     

  14. Nice, I would add some measure requirements for the time intervals for HA test scenarios as well, i.e. using some representative exercise of the controller through the API(s)/interfaces during the simulated failure, not just "after" (1 b.).  

  15. Before you can define performance, security, and availability requirements, you have to define the underlying cloud platform against which they are run.  Are these intended to apply to the standard ONAP lab environments, or all possible cloud infrastructures on which ONAP may be run?

    1. I am not sure what the issue is exactly, this has been done successfully in many other projects - if we establish and document the representative reference configuration(s) that include ONAP components along with what is used as "underlying platform", then what is missing ? This does not mean that all possible platforms and configurations need to be included in the reference config. Obviously, it would be desirable to have the reference platform composed of widely utilized open / open source technologies to the extent possible to maximize the availability of it for everyone.

      1. That's what I am asking for; to establish and document the reference configurations against which performance and security will be judged.  If that has been done already, then maybe I just missed it here.

        1. Christopher – this has not been done to my knowledge, and it will take some discussion about how much we want to do in the community.  The expectation is that if you need extreme performance and availability, the operator will need to build out the appropriate hardware and network to provide that.  I'm not sure the ONAP community wants to build out one of those environments and provide it as a reference platform.  Of course, if we do performance testing, we need to describe the platform that it was tested upon.

  16. My previous experience on some other projects with 'Consul' worked out very well for resiliency (load sharing as well as high availability).  Consul is 

    • a distributed DB with master selection - Supports strict consistency as well as lazy consistency models.
    • KV pair database (non-SQL)
    • Supports Service registration and service discovery.
    • Also supports HTTP based DNS like query resulting all active application nodes 
    • Does health checks to find out active and inactive application nodes.
    • Can be modified to respond with nodes that are more idle as part of HTTPS based DNS query.

    As I understand consul is being used for service discovery in MSB.  So, it is not new for ONAP community.  By making consul as DB also, you get all the benefits of consul - Service discovery,  distributed DB and hence resiliency,  health checks of applications/services etc... 

    In my view, this is a serious candidate. 

    Srini




    1. Srini, as part of the OOM project Consul is already being deployed - a set of three servers and an agent (we'll likely create more) which are monitoring the health of the ONAP components.  This is content for the Beijing release but with Consul we're ahead of schedule.

    2. Srini,

      I agree this is a good candidate.  The Architecture Subcommittee will likely be the one to be looking at common platform capability for resiliency across ONAP.  Bharath Balasubramanian has also made some proposals and presented to the Use Case Subcommitte yesterday on this topic.

    3. Srini,

      My only concern with Consul is regarding its geo-distributed performance. From what I understand, Consul supports strong consistency within data-centers (using paxos-based Raft) and weak consistency across data-centers. This could mean that component replicas across data centers can read stale state. While this maybe acceptable to some, it may not be suitable to others. Based on this idea, we have prototyped a tool called MUSIC in which we maintain state in a geo-distributed key-value store but provide a companion locking service that can be built using zookeeper or Consul (currently Zookeeper) that allows you strong consistency when you need and eventual consistency as the default. Further Consul may not work off the bat with an ONAP component that uses a SQL db. MUSIC provides a SQL plugin called mdbc for that purpose. I can talk about this in more detail wherever appropriate. 

      1. Hi Bharath,

        Consul (or etcd or any other Distributed KV databases) all support strong consistency within LAN for performance reasons. By the way, there is always a way to make a LAN over WAN (using VXLAN kind of technologies), thereby making consul work across WAN. But,there could be performance issues.

        My intention of introducing consul is for storing micro service configuration (which is normally provided via resource properties).  If any resource property is modified, then it should be known to all micro-service instances.  These databases are not meant to replace MySQL, noSQL,  time series databases such as InfluxDB or Graph databases such as Titan/Janusgrah. Each type of database has its own purpose.

        MySQL or noSQL databases are needed where there is requirement for high amount of data or where the transaction rate is very high.

        Time series databases are needed where there is time component for the records.

        Graph databases are required where there is multiple relationships (parent, child, sibling etc...)

        Distributed KV databases are needed where there is strong consistency and high availability of data is required.  It is expected that these database size does not exceed 1 or 2 gigabytes. That is why, these are normally used for critical configuration data. Also, these are used when there is a need for leadership among distributed application instances,  where there is a need for distributed semaphore and where there is a need for notification of data that is modified/added/deleted.

        In summary,  I am not thinking that Consul or etcd will replace existing Maria/Cassandra/Influx/Janusgraph databases that are being used in ONAP.  It is only meant to ensure that all instances of an application have latest resource property values and some critical configuration data (typically that is configured via Portal)

        Thanks

        Srini


  17. Thanks for all the insightful comments from the community!  Working with a small group, I've tried to incorporate the comments into an update to the wiki that I just made.  Major updates:

    • Updated the general approach to better describe the purpose and process of this effort.  Of note, at this point we're not making platform or project-level decisions, though the comments above will be helpful when it comes to that.
    • Tried to clarify the performance requirements and how level 3 is incremental improvement upon level 2
    • Stability - provided more input into what soak tests should be and how run
    • Availability - changed this to "Resiliency" which more describes the requirements.  Removed implementation descriptions from the requirements.  Also removed hard measurement times, but looking for improvement in request failure rate and data loss rates.
    • Added additional requirements to manageability

    Work that still remains:

    • reconciling this work with the NFR work
    • getting a definition of a minimal footprint (for the manageability requirement)

    These will be presented to the TSC soon, and with their blessing will move to the Architecture Subcommittee to work with project teams on defining appropriate levels for each project for Beijing.  Thanks!

    1. There seems to be a distinction in the soak tests between platform level soak and component level soak. I'm assuming the component here refers to platform components?

      I expect there would be similar tests for VNFs and services, though performing these  is beyond the scope of this carrier grade platform initiative, documenting how to use the platform to develop robust VNFs and Services may be useful. We may want to bring these into the Integration test activities in Beijing to harden some non platform use case VNFs/services as examples. 

      1. Yes, component level refers to platform components.  Each project will have a specified level to achieve and all software components within that project will be expected to conduct the soak test.

        I agree that VNFs and services would also benefit from many of the same techniques that will be defined as part of this effort.  My first reaction is that the VNF Requirements project would provide resiliency and stability recommendations for the VNFs.  Perhaps VNF SDK or VVP would be in a position to provide supporting tooling to test those recommendations.

        1. Please include OOM in this list of supporting tools.  The OOM team is actively working on recoverability of ONAP components and will progress to clustering and even geo-redundancy over the next release.

          1. Yes, Roger Maitland, OOM would be in the candidate tool list for resiliency of the ONAP components.  The second part of my comment above was about Steven's question on how/if we provide tools or recommendations to the network vendors on how to make their VNFs resilient.  Would OOM cover that?

            1. If the VNFs use containers then possibly yes but this would be highly dependent on the VNF vendor.

  18. Jason Hunt, the OOM team is actively working on high availability.  We've been testing for recoverability (basically Brian's 1b list) and enhancing the existing OOM k8s deployment specifications where required to ensure automatic operation.  We'll be able to demo an interesting case in a week or two.  Let me know if you'd like more information.  Brian Freeman, once you're ready it would be great to deploy a highly available configuration of SDNC with OOM as a model of how to go forward with ONAP HA.

    1. Roger Maitland, this is great!  I know Bharath Balasubramanian has made some proposals to the Use Case Subcommittee on common resiliency platform capabilities as well.  And, of course, some of the capabilities in MSB can help with load balancing, etc.  We'll want to come together to make sure we have a solution that makes it as easy for the projects to implement.  I'd also like to see us leveraging common capabilities in the cloud native world, so that we aren't reinventing the wheel.  Perhaps it would make sense to have all of these presented to the architecture subcommittee for discussion?  cc: Chris Donley

  19. In order to help understand potential scope for R2, I did a compare/contrast of the R2 proposals for Non-functional requirements against the Carrier Grade requirements list above.  Hopefully I captured things accurately:

    NFRs covered or mostly covered by Carrier Grade Requirements

    • Support ONAP Platform Upgrade – covered under manageability
    • Support ONAP Reliability – covered under resiliency
    • Support ONAP Scalability – covered under scalability
    • Support ONAP Monitoring – mostly covered under manageability (common logging, transaction tracing)
    • Support a Common ONAP security framework for authorization and authentication – implementation detail of security requirements
    • Secure communication among the ONAP Services – covered under security requirements
    • PKCS11 support for private keys – could be implementation detail under security requirements
    • Software Quality Maturity – possible implementation of stability requirement level 3
    • Support for a common resiliency platform – covered under resiliency requirements
    • Programmable level of platform quality – I believe this is the spirit under which the carrier grade requirements have been developed

     

    NFRs partially covered by Carrier Grade Requirements

    • Secure all  Secrets and Keys (Identity or otherwise)  while they are in persistent memory or while they are in use – not required by could be implementation detail of security requirements
    • Support for ensuring that all ONAP infrastructure VMs/containers are brought up with intended Software – not required by could be implementation detail of security requirements
    • API Versioning AND Consistent  API pattern – partially covered by manageability requirement to allow for independent component upgrades
    • Provide powerful test and debug environment/tools - partially covered by manageability; also expect more testing tools/techniques to come from this effort (testing for resiliency and security for example) 


    NFRs not covered by Carrier Grade Requirements

    • CA Service for VNFs certificate enrollment – security requirements specify need for certificates, but CA Service not currently in scope to be provided by ONAP
    • Support VNF CSAR validation – probably more of a functional requirement
    • Support ONAP Portal Platform enhancements
    • Documentation and some toolkit to use ONAP and adapt it
    • Unified Design Environment
  20. stability, resiliency and scalability dimensions all have (i) a close relationship with or are impacted by the load on the system and (ii) can be hard to separate  between the platform and related components of the service. ONAP is not a static service environment  - on the contrary  the purpose is to enable services to become much more dynamic  through new service instances (scaling) and new service types (VNF onboarding & service design). Whichever category you classify it, a normal expectation of ONAP operation is the scaling of VNFs under dynamic loads.   This is not new functionality for ONAP, though it was not stressed in Amsterdam. 

    From the definitions above though, it seems to be missed - scaling and resiliency above seems to be defined in terms of platform scaling rather than Service or VNF and  Stability is focused on static loads.  Perhaps it might be useful to think in terms of applying these attributes above to the platform operations provided via the various consoles etc. rather than just platform components.   

  21. Three points I'd like to offer for consideration in regards to usability.

    1. Consider expanding usability requirement by defining at least two categories of "users."

    Category 1: Users that see ONAP as a platform: operations teams in telecom operators, VARs and system integrators

    Category 2: ONAP developers

    2. Expand usability metrics as follows

    • ONAP User (operator, VARs, integrators)
      • Level 1
        • Deployment and platform administration
          • Documentation is available
          • Deployment tutorial available
        • Service design and deployment
          • Documentation available
          • Service design and deployment tutorial available
      • Level 2
        • ONAP Platform can be deployed on different platforms (os, cpu architecture)
        • ONAP can be deployed in less than x hours
        • External API documentation available
        • Service discovery and registration available ( to add and use external controllers and applications )
    • ONAP Developer (developer, tester, technology vendors)
      • Level 1
        • API documentation
        • Adherence to coding guidelines
        • Consistent UI across ONAP components
      • Level 2
        • Adherence to API design guidelines
        • Adherence to standard data model (when applicable)
        • Usability testing conducted
        • Tutorial documented

    You can think of ONAP as a platform or as a set of projects/components/services. Other partitions are certainly possible but these address two key categories of user personas.

    "Users" of ONAP as a platform: concerned with the deployment and management (activation, updates, monitoring, etc) of ONAP as a solution. What they do is in way similar to what the OOM team does today. These users are not concerned, for example, with development practices or coding guidelines.

    ONAP developers: although some developers/projects will be interested in deploying a complete ONAP instance, most projects focus on a few components, they do care about code quality. They may be more interested in adherence to recommended practices for API design, API documentation, code development and testing,  standard models.

    3. Tie usability to Casablanca's theme of deployability

    Usability is closely related to, and should be aligned with, the theme of the Casablanca release: deployability. By definition ONAP must be vendor agnostic. By implication, ONAP must be infrastructure and compute platform agnostic ( different cloud providers, different OSs, different CPU architectures, etc).