Page History

Table of Contents

exclude	1

This is a potential draft of a project proposal template. It is not final or to be used until the TSC approves it.

Link to Project Proposal training materialsProject Name:

Proposed name for the project: DataLake
Proposed name for the repository: datalake

Project

...

Goal

Permanently persist the

...

Build permanent storage to persist the data that flows through ONAP, and build provide ready-to-use data analytics tools applications built on itthe data.

Project description:

Background

There are large amount of data flowing among ONAP components, mostly via DMaaP and Web Services. For example, all field events collected by DCAE collectors go through DMaaP. DMaaP DMaaP data is read and processed by many ONAP components. DMaaP is backed by Kafka, which is a system for Publish-Subscribe, and is not suitable for data query and data analytics. Additionally, data in Kafka where data is not meant to be a permanent storage permanent and data got gets deleted after certain retention period. Thus it is useful to persist the data that flows through DMaaP to databases, with the following benefits:

...

Data is stored in a permanent storage for history record. DMaaP is free to set its message retention period without taking history record as a concern.

...

With database table’s schema, it is convenient to query and retrieve data.

...

Though some components may store processed result into their local databases, most of the raw data will eventually lost. We should store these data, which could provide insight to the network operation, with help aata analytics and machine learning technologies.

Project Description

In this project, we provide will:

Provide a systematic way to real-time ingest DMaaP data to a few selected Big Data storage systems, such as, but not limit to, Couchbase, a distributed document-oriented database,

...

Druid,

...

a data store designed

...

for low-latency OLAP analytics, and HBase, a Hadoop database for mass batch processing. What data goes to which databases is configurable, depending on what problems we try to solve, and the results we want to achieve. For example, storing data in Druid, a OLAP storage, we can integrate it with OLAP tools like Superset, and time series tools like Grafana. In the future, new requirements may require supporting additional storage systems.
Provide sophisticated and ready-to-use

...

interactive analytics tools that are built on the data.

...

DataLake's goals are:

Provide a systematic way to real-time ingest DMaaP data to Couchbase, a distributed document-oriented database, and Druid, a data store designed for real-time OLAP analytics.
Also serves as a common data storage for other ONAP components, with easy access.
Provides APIs and ways for These tools fall into two categories: integrated third party data analytics tools, such as Superset and Grafana, and custom applications developed by us. Custom applications includes ETL applications, Big Data analytics programs developed in Spark framework, and Machine Learning models. While integrated third party tools are mostly for system operators (human beings) with GUI interfaces, custom applications' results are consumed by both system operators and programs like ONAP components and external systems (e.g. OSS/BSS/OSS) to consume the data.
Provides sophisticated and ready-to-use data analytics tools built on the data.

Architecture:

Image Removed

Architecture

Image Added

The data storage and associated tools are external infrastructures to ONAP, to be installed only once initially, or making use of existing infrastructures. Since costume setting and applications will be deployed to and run on them, they are really integrated parts of DataLake.

Scope

...

Data Sources
Monitor all or selected Data DMaaP topics, real-time reads read the data, and persists persist it..
Other ONAP components can use DataLake as a storage to save can leverage DataLake’s rich analytics features by publishing application specific data , through DMaaP or DataLake REST APIsto DMaaP.
Other data Data sources other than DMaaP will be supported if needed.
Dispatcher
Provide admin REST API for configuration configurations and topic management. A topic can be configured to be exported to which data stores, with Couchbase and Druid supported initially, and TTL (Time To Live) in the stores. We may will support more distributed databases in the future if needed.
Provide SDC/Design time framework UI for managementAdmin GUI to manage the dispatcher, making use of the above admin REST API. It also manages the analytics tools and applications.
Document Store
Monitor selected topics, real-time pull the data and insert it into Couchbase, one table for each topic, with the same table name as the topic name.
Data types JSON, XML, and YAML are auto converted into native store schema. We may support additional formats. Data not in these formats is stored as a single string.
Provide REST API for data query, while applications can access the data through native API as well.
Couchbase supports Spark direct running on it, which allow complicate analytics tools to be built. We may will develop Spark analytics applications if needed.
Other ONAP components can take advantage this to store their operational data. If we need to run heavy analytics jobs on historical data, we should separate the operational data from historical data. Otherwise we have the option to have both to coexist, due to Couchbase's scalability.
OLAP Store
Monitor selected topics, real-time pull the data and insert it into Druid, one datasource for each topic, with the same datasource name as the topic name.
Extracts the dimensions and metrics from JSON files, and pre-configure Druid settings for each datasource, which is customizable through a web interface.
Integrate Apache Superset for data exploration and visualization, and provide pre-builds interactive dashboards.
Integrate Grafana for time series analytics.
Other Stores
Based on future requirements, other storage may be supported. DataLake is very open and flexible in this area, and try to pick the right Big Data technologies for the tasks.
Example 1, if in the future we need to support unstructured data, we will consider including search engine technologies like Elastic Stack (ELK) for it.
Example 2, for mass batch data processing, we may want to store data in HBase, which can be obtained from PNDA or existing Hadoop infrastructures.

Architecture Alignment

...

How does this project fit into the rest of the ONAP Architecture?
DataLake provides both API and UI interfaces. UI is for analyst to analysis the data, while API is for other ONAP (and external) components to query the data. For example, UUI can use the API to retrieve historical events. Some of DCAE service applications may also make use of the APIs.
What other ONAP projects does this project depend on?
DataLake depends on DMaaP for data ingestion, also depends on some other common services: OOM, SDC, MSB.

In Relation to Other ONAP Components
DCAE focuses on being a part of automated closed control loop on VNFs, storing collected data for archiving has not been covered by DCAE scope. (see ONAP wiki forum). Envision that some DCAE analytics applications may use the data in DataLake.
PNDA is an infrastructure that bundles a wide variety of big data technologies for data processing. Applications are to be developed on the technologies provided by PNDA. The goal of DataLake is to store DMaaP and other data, and build ready-to-use applications around the data, making use of suitable technologies, whether they are provided by PNDA. Currently Couchbase, Druid and Superset are not included in PNDA. We may make use of HDFS, HBase provided by PNDA.
Logging project‘s data source is logs, which are unstructured, and content is at the mercy of developer who usually output only portion of the information. Besides, applications need to follow ONAP Application Logging Specification v1.2 (Casablanca). On the other hand, DataLake's data source is from DMaaP. Since the data from DMaaP is meant to be consumed by ONAP components, as far as I know, they are all structured. Since DataLake is unintrusive, other components do not need to make any change to harvest the benefit from DataLake.
POMBA uses agents (Context Builders) to collect configuration data (service model / service instance / VNF instance) from certain modules, and use rules to validate them. Results are stored in Elasticsearch. Its wiki does not state (I couldn’t find) if the collected raw data is stored as well. The data is passed among POMBA components via DMaaP, thus DataLake can have a copy of the data, and POMBA can leverage DataLake’s rich features for historical data query and analytics.
ONAPARC-233 , "Platform data management layer (consolidate multiple DBs)", sounds like having some overlap with DataLake's scope. But there is no detail in that proposal. Will comment when more details are revealed.
How does this align with external standards/specifications?
APIs/Interfaces - REST, JSON, XML, YAML
Information/data models - Swagger JSON
Are there dependencies with other open source projects?
Couchbase
Apache Druid
Apache Superset
Grafana
Apache Spark
All use Apache 2.0 License.

Other Information

...

link to seed code (if applicable)
Initial version of Document Store and OLAP Store and associated tools were tested in China Mobile CCVPN use case lab. Seed code is in China Mobile gitlab and is ready to contribute.
Vendor Neutral
Yes
Meets Board policy (including IPR)
This proposal was presented at 2018-10-29 Dublin Architecture Planning F2F Meeting, and a JIRA ticket is created.
Use the above information to create a key project facts section on your project page

Key Project Facts

...

Facts
Info
PTL (first and last name) Guobiao Mo
Jira Project Name DataLake
Jira Key DATALAKE
Project ID datalake
Link to Wiki Space

Release Components Name

...

Note: refer to existing project for details on how to fill out this table
Components Name
Components Repository name
Maven Group ID
Components Description
datalake datalake org.onap.datalake Data stores for DMaaP ONAP data, with data access API and GUI data analysis analytics tools.

Resources committed to the Release

...

Note 1: No more than 5 committers per project. Balance the committers list and avoid members representing only one company. Ensure there is at least 3 companies supporting your proposal.
...
guobiaomo
Role
First Name Last Name
Linux Foundation ID
Email Address
Location
PTL Guobiao Mo

guobiaomo@chinamobile.com Milpitas, CA USA. UTC -7 Committers Guobiao Mo guobiaomo guobiaomo@chinamobile.com Milpitas, CA USA. UTC -7

Xin Miao xinmiao2013 xin.miao@hauweimiao@huawei.com Texas, USA, CST

Zhaoxing Meng Zhaoxing meng.zhaoxing1@zte.com.cn Chengdu, China. UTC +8

Tao Shen shentao999 shentao@chinamobile.com Beijing, China. UTC +8

Xinhui Li
lxinhui@vmware.com
Contributors Ekko Chang ekko.chang ekko.chang@qct.io Taipei, UTC +8

Kate Hsuan mizunoami123 kate.hsuan@qct.io Taipei, UTC +8

May Lin
may.lin@qct.io Taipei, UTC +8

Space shortcuts

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Project

Goal

Project description:

Background

Project Description

Architecture:

Architecture

Scope

Architecture Alignment

Other Information

Key Project Facts

Facts
Info
PTL (first and last name) Guobiao Mo
Jira Project Name DataLake
Jira Key DATALAKE
Project ID datalake
Link to Wiki Space

Release Components Name

Note: refer to existing project for details on how to fill out this table
Components Name
Components Repository name
Maven Group ID
Components Description
datalake datalake org.onap.datalake Data stores for DMaaP ONAP data, with data access API and GUI data analysis analytics tools.

Resources committed to the Release

Facts	Info
PTL (first and last name)	Guobiao Mo
Jira Project Name	DataLake
Jira Key	DATALAKE
Project ID	datalake
Link to Wiki Space

Components Name	Components Repository name	Maven Group ID	Components Description
datalake	datalake	org.onap.datalake	Data stores for DMaaP ONAP data, with data access API and GUI data analysis analytics tools.

Role	First Name Last Name	Linux Foundation ID	Email Address	Location
PTL	Guobiao Mo
guobiaomo@chinamobile.com	Milpitas, CA USA. UTC -7	Committers	Guobiao Mo	guobiaomo	guobiaomo@chinamobile.com	Milpitas, CA USA. UTC -7
	Xin Miao	xinmiao2013	xin.miao@hauweimiao@huawei.com	Texas, USA, CST
	Zhaoxing Meng	Zhaoxing	meng.zhaoxing1@zte.com.cn	Chengdu, China. UTC +8
	Tao Shen	shentao999	shentao@chinamobile.com	Beijing, China. UTC +8
	Xinhui Li		lxinhui@vmware.com
Contributors	Ekko Chang	ekko.chang	ekko.chang@qct.io	Taipei, UTC +8
	Kate Hsuan	mizunoami123	kate.hsuan@qct.io	Taipei, UTC +8
	May Lin		may.lin@qct.io	Taipei, UTC +8

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 2

New Version Current

Key

Project

Goal

Project description:

Background

Project Description

Architecture:

Architecture

Scope

Architecture Alignment

Other Information

Key Project Facts

FactsInfoPTL (first and last name)Guobiao MoJira Project NameDataLakeJira KeyDATALAKEProject IDdatalakeLink to Wiki Space

Release Components Name

Note: refer to existing project for details on how to fill out this tableComponents NameComponents Repository nameMaven Group IDComponents Descriptiondatalakedatalakeorg.onap.datalakeData stores for DMaaP ONAP data, with data access API and GUI data analysis analytics tools.

Resources committed to the Release

Facts
Info
PTL (first and last name) Guobiao Mo
Jira Project Name DataLake
Jira Key DATALAKE
Project ID datalake
Link to Wiki Space

Note: refer to existing project for details on how to fill out this table
Components Name
Components Repository name
Maven Group ID
Components Description
datalake datalake org.onap.datalake Data stores for DMaaP ONAP data, with data access API and GUI data analysis analytics tools.