我要评分
获取效率
正确性
完整性
易理解

Introduction to OmniStream

What's New

  • [2026-03-30]: Released OmniStream 1.2.0. Add the header file installation content for the dependencies used by the UDF translation tool.
  • [2025-12-30]: Released OmniStream 1.1.0. Added the task-level operator rollback mechanism in SQL scenarios, and added checkpoint and restore support for KeyedCoProcess in DataStream scenarios.
  • [2025-06-30]: Released OmniStream 1.0.0. SQL: Implemented acceleration for the Calc, GroupAgg, Join, Deduplicate, Rank, Window, and Kafka Source/Sink operators; implemented the efficient data organization method OmniVec; added support for memory and RocksDB state backends. DataStream: Implemented acceleration for the Kafka Source, Kafka Sink, Map, FlatMap, Reduce, and filter operators; implemented the UDF basic framework and UDF translation basic library; added support for the UDF automatic native framework to run stateful and stateless cases such as DataStream Wordcount; added support for the memory state backend.

Introduction to the Project

Overview

The big data features of OmniRuntime are presented in the form of plugins to improve the performance of data loading, computing, and exchange from end to end.

Data volumes generated from Internet services have been growing much faster than CPUs' computing power. The open-source big data ecosystem is also developing on a fast track. However, diversified computing engines and open source components make it difficult to improve data processing performance throughout the lifecycle. Different big data engines use their own unique tuning policies and technologies to improve performance and efficiency. Some tuning items may be applied across multiple engines, which may cause resource contention and conflicts, reducing overall computing performance.

OmniRuntime consists of a series of features provided by Kunpeng BoostKit for Big Data in terms of application acceleration. It aims to improve the performance of end-to-end data loading, computing, and exchange through plugins, thereby improving the performance of big data analytics.

As a subfeature of OmniRuntime, OmniStream uses native code (C/C++) to implement Flink SQL operators to improve query performance. The Flink engine is reconstructed natively for enhanced performance.

It has been adapted to Flink 1.16.3.

Architecture

The OmniStream Flink Native feature uses native code (C/C++) to reconstruct the logic of Flink SQL and DataStream operators, improving query performance.

  • For SQL, OmniStream uses C++ and vectorized instructions to implement operators, leveraging vectorization to enhance SQL computing performance.
  • For DataStream, OmniStream uses C++ and vectorized instructions to implement operators, fully leveraging the performance advantages of native code to improve performance in DataStream scenarios.

SQL

OmniStream uses an architecture consisting of the Java Adapter layer and the CPP Core layer.

  • The Java Adapter layer is implemented in Java and is responsible for generating native execution plans and falling back to the Java runtime in unsupported scenarios.
  • The CPP Core layer is implemented in C++ and is responsible for implementing operator logic and data transmission.

An SQL query submitted through SQL or Table API is parsed, and an execution plan is then generated. The Java Adapter layer obtains the execution plan, initializes related tasks in the CPP, and generates an operator chain. After the initialization process is complete, the task runs to read data from the source. After a series of operator processing operations, Sink outputs the result.

Figure 1 shows the architecture of OmniStream Flink SQL Native.

Figure 1 Architecture of OmniStream Flink SQL Native

DataStream

After receiving an input, the DataStream API parses it into an execution plan. The Java adapter layer parses the plan, initializes related tasks on the C++ side, and builds the corresponding operator chain. After the initialization process is complete, the task runs to read data from the source. After a series of operator processing operations, Sink outputs the result.

Figure 2 shows the architecture of OmniStream Flink DataStream Native.

Figure 2 Architecture of OmniStream Flink DataStream Native

Application Scenarios

OmniStream improves the processing performance of the Flink engine while preserving existing development practices and architectural compatibility. In large-scale real-time data analysis scenarios, it significantly enhances processing capacity and execution efficiency.

Apache Flink is an open source real-time stream processing engine. As services rapidly evolve and data volumes surge, Flink's performance bottlenecks have started to surface in certain high-load scenarios. This is particularly evident in Internet-related use cases, where Flink trails some peer offerings in terms of performance. OmniStream uses native code (C/C++) to implement Flink SQL operators, improving query execution efficiency. It natively reconstructs the Flink engine to enhance performance.

OmniStream supports Flink 1.16.3. It parses user-submitted SQL statements into a series of operators and executes them using native operators provided by it. These native operators replace open source Flink operators, significantly improving performance.

Nexmark: It is a benchmark suite designed for evaluating queries on continuous data streams. It offers fair and comprehensive performance tests for stream processing systems, serving as a tool for both optimization guidance and performance comparison.

Constraints

OmniStream has restrictions on supported data types, operators, and state backends. Plan your tasks accordingly and avoid unsupported scenarios.

SQL

  • OmniStream Flink Native supports the Nexmark benchmarking suite, including Nexmark data types and built-in functions.
  • Supported data types: BIGINT, TIMESTAMP(3), and VARCHAR.
  • Supported expressions: + - * /
  • Supported built-in functions: LOWER, SPLIT_INDEX, DATE_FORMAT, MOD, and COUNT_CHAR.
  • Supported GroupAggregate functions: SUM(BIGINT), COUNT(BIGINT), AVG (BIGINT), MIN(BIGINT), MIN(VARCHAR), MAX (BIGINT), and MAX(VARCHAR).
  • The join key of the join operator must be of the BIGINT type, and the operation type must be Inner Join.
  • The Deduplicate and Rank operators allow only Partition By BIGINT. All fields in the query table must be of the supported data type and only the ROW_NUMBER function is supported. Partition By of Rank supports only one field of the BIGINT type. Order By of TOPN supports only one BIGINT field, whose sorting rule is DESC. Order By of TOP1 supports a maximum of two fields. The type can be BIGINT TIMESTAMP(3), and the sorting rule can be DESC or ASC.
  • The Group By column of the Aggregate operator allows only BIGINT.
  • The aggregate function of the LocalWindowAGG and GlobalWindowAGG operators is COUNT or MAX. The aggregate function of the GroupWindowAGG operator is COUNT.
  • The LocalWindowAGG and GlobalWindowAGG operators support only the TUMBLE and HOP windows.
  • The GroupWindowAGG operator supports only the SESSION window.
  • The external table data source of the Lookup Join operator supports only CSV files.
  • The state backend supports only the memory and RocksDB.
  • Flink stores states in the memory state backend, and the memory usage grows over time as the volume of data increases. In comparison, OmniStream uses the columnar vectorized architecture to optimize performance. Its state storage behaves the same as the native Flink while delivering a higher processing speed and consuming the memory space faster. Therefore, the Nexmark benchmark test cases support a maximum of 50 million data records.

DataStream

For details, see Supported DataStream Operators and UDFs.

  • Source and Sink support only Kafka data sources.
  • These operators are supported: Map, FlatMap, GroupReduce, Filter, Source, and Sink.
  • The Filter operator must be RichFilterFunction.
  • The state backend supports only memory and does not support checkpoints.

Directory Structure

The full project directory structure is as follows:

├─cpp
│  ├─conf
│  ├─connector
│  ├─core
│  ├─datagen
│  ├─include
│  ├─jni
│  ├─runtime
│  ├─streaming
│  ├─tabler
│  ├─test
│  ├─third_party
│  ├─translate
│  └─zemo
├─docs
│   └── en                                                   # English document directory
│       ├── figures                                          # Directory of images in documents
│       ├── public_sys-resources                             # Directory of images in documents
│       ├── faq.md                                           # OmniStream FAQs
│       ├── installation_guide.md                            # OmniStream Installation Guide
│       ├── quick_start.md                                   # Quick Start
│       ├── release_notes.md                                 # OmniStream Release Notes
│       ├── user_guide.md                                    # OmniStream User Guide
├─README_en.md
└─scriptss

Release Notes

For details about feature changes in each version, see Release Notes

Environment Deployment

For details about the environment dependencies and installation methods of OmniStream, see Installation Guide

Quick Start

For instructions on quickly verifying whether OmniStream is active and its performance improvements, see Quick Start

Name Overview
Quick Start Provides guidance on how to quickly enable and verify the OmniStream feature.
Release Notes Provides basic information and feature updates of each OmniStream version.
Installation Guide Describes how to install OmniStream.
User Guide Provides details about how to use OmniStream.
FAQs Provides answers to frequently asked questions (FAQs) about installing and using OmniStream.

Security Statement

Routine Antivirus Software Check

Periodically scan clusters and Spark components for viruses. This protects clusters from viruses, malicious code, spyware, and malicious programs, reducing risks such as system breakdown and information leakage. Mainstream antivirus software can be recommended for antivirus check.

Log Control

  • Check whether the system can limit the size of a single log file.
  • Check whether there is a mechanism for clearing logs when the log space is used up.

Vulnerability Fixing

To ensure the security of the production environment and reduce the risk of attacks, enable the firewall and periodically fix the following vulnerabilities:

  • OS vulnerabilities

  • JDK vulnerabilities

  • Hadoop and Spark vulnerabilities

  • ZooKeeper vulnerabilities

  • Kerberos vulnerabilities

  • OpenSSL vulnerabilities

  • Vulnerabilities in other components

    The following uses CVE-2021-37137 as an example.

    Vulnerability description:

    Netty 4.1.17 has two Content-Length HTTP headers that may be confused. The vulnerability ID is CVE-2021-37137.

    The system uses the hdfs-ceph (version 3.2.0) service as the storage object with decoupled storage and compute. This service depends on aws-java-sdk-bundle-1.11.375.jar and involves this vulnerability. You are advised to update the vulnerability patch in a timely manner to prevent hacker attacks.

    Impact:

    Netty 4.1.68 and earlier versions

    Handling suggestion:

    Currently, the vendor has released an upgrade patch to fix the vulnerability. For details, visit GitHub.

SSH Hardening

During the installation and deployment, you need to connect to the server through SSH. The root user has all the operation permissions. Logging in to the server as the root user may pose security risks. You are advised to log in to the server as a common user for installation and deployment and disable root user login using SSH to improve system security.

Check the PermitRootLogin configuration item in /etc/ssh/sshd_config.

  • If the value is no, root user login using SSH is disabled.
  • If the value is yes, change it to no.

Public Network Address Statement

Table 1 Public network address statement

Open Source/Third-Party Software

GCC

Type

Open source software

Public IP Address/Public URL/Domain Name/Email Address

https://gcc.gnu.org/bugs/

File Type

Binary

File Name

libboostkit-omniop-vector-2.0.0-aarch64.so

Usage

This email address is the official address of the open source component GCC, and is used only to compile the open source component. This email address is not used inside this product.

Software Package

BoostKit-omniop_2.0.0.zip

boostkit-omniop-operator-2.0.0-aarch64-centos.tar.gz

Disclaimer

To OmniStream users

  • This tool is intended solely for debugging and development. You are responsible for any risks and should carefully review the following information:

    • Data processing and deletion: Users are responsible for managing and deleting any data generated while using this tool. You are advised to promptly delete any related data after use to prevent information leaks.
    • Data confidentiality and transmission: Users understand and agree not to share or transmit any data generated by this tool. Neither the tool nor its developers are responsible for any information leaks, data breaches, or other negative consequences.
    • User input security: Users are responsible for the security of any commands they enter and for any risks or losses resulting from improper input. The tool and its developers are not liable for issues caused by incorrect command usage.
  • Disclaimer scope: This disclaimer applies to all individuals and entities using this tool. By using the tool, you acknowledge and accept this statement and assume all risks and responsibilities arising from its use. If you do not agree, please stop using the tool immediately.

  • Before using this tool, please read and understand the preceding disclaimer. If you have any questions, contact the developer.

To data owners

If you do not want your model or dataset to be mentioned in OmniStream, or if you wish to update its description, please submit an issue on GitCode. We will delete or update your description according to your request. Thank you for your understanding and contribution to OmniStream.

License

For details, see the License file.

The documents in the xxxdocs directory are licensed under CC BY 4.0. For details, see the License file.

Contribution Statement

  1. Submit an error report: If you discover a vulnerability in OmniStream that is not a security issue, first search the Issues in the OmniStream repository to avoid submitting duplicates. If the vulnerability is not listed, create a new issue. If you discover a security-related problem, do not disclose it publicly. Please refer to the security handling guidelines for details. All error reports must include complete information about the issue.
  2. Security issue handling: For guidance on handling security issues in this project, please contact the core team via email for instructions.
  3. Resolving existing issues: Review the repository's issue list to identify issues that need attention, and attempt to resolve them.
  4. How to propose new functions: Use the Feature tag when creating an issue for a new function. We will review and confirm proposals periodically.
  5. How to contribute:
    1. Fork the repository of the project.
    2. Clone it to your local machine.
    3. Create a development branch.
    4. Local testing: All unit tests, including any new test cases, must pass before submission.
    5. Commit your code.
    6. Create a pull request (PR).
    7. Code review: Modify the code according to review comments and resubmit your changes. This process may involve multiple rounds of iterations.
    8. After your PR is approved by the required number of reviewers, the committer will conduct the final review.
    9. After your PR is approved and all tests pass, the CI system will merge it into the project's main branch.

Suggestions and Feedback

You are welcome to contribute to the community. If you have any questions or suggestions, please submit an issue. We will respond as soon as possible. Thank you for your support.

Acknowledgments

OmniStream is jointly developed by the following Huawei departments:

Kunpeng Computing BoostKit Development Dept

Thank you to everyone in the community for your PRs. We warmly welcome contributions to OmniStream!