Ultrascan

Ultrascan is an open-source high-performance regular expression matching library. The Kunpeng BoostKit platform has adapted and optimized it for the Arm instruction set and has innovated algorithms (for example, a hybrid model with a short-rule bypass) to address the weaknesses of rule sets used in actual data distribution services.

Ultrascan operates in two main phases: compilation and runtime. During the compilation phase, one or more rules are compiled into a read-only database. The runtime phase does not require loading the rules; instead, it uses this pre-compiled database for pattern matching.

Ultrascan is the fastest in rule set matching scenarios with less than 100,000 rules. It is widely used in scenarios such as application classification, IDP, and web application firewall (WAF). In typical open-source solutions such as Suricata, the detection engine is a top computing hotspot. Optimizing Ultrascan and integrating it into such solutions can effectively improve the end-to-end performance of data distribution services.

Key technologies:

Efficient pre-filtering algorithm, multi-pattern matching algorithm, and automata algorithm.

Applicable scenarios:

Information security data distribution, fine-grained data distribution for carriers, public safety data distribution, IDPS, WAF, foundation model application firewalls, etc.

Short-Rule Bypass Technology

A rule set containing short rules can cause performance bottlenecks in Ultrascan rule matching. The rule sets deployed on the customer's live network usually have such bottlenecks, either from manually written short rules or short strings in regular rules decomposed into short rules by the graph partitioning algorithm. After short rules enter the multi-pattern matching, their high matching rate frequently interrupts the frontend fast instruction pipeline, triggering relatively slow backend exact verification. The hybrid processing of short and common rules introduces redundant operations, resulting in performance bottlenecks.

Huawei Ultrascan provides a hybrid model with a short-rule bypass, which significantly improves the matching performance for rule sets containing short rules. The model separates short rules from common rules and uses a few vector instructions to implement a high-speed bypass algorithm. This eliminates redundant operations and maintains the instruction pipeline parallel efficiency of the main algorithm, greatly improving the overall matching performance.

Short rules lead to poor performance in Suricata due to excessive vector calculations and exact checks during matching. Research has shown that simply removing eight single-byte rules from one subset can improve overall matching performance by 50%. To address this, the latest version introduces the hybrid model with a short-rule bypass. The model excludes short rules from the main algorithm using an efficient bypass algorithm enhanced by TBL lookup table instructions. While this approach adds 20% more overhead for short rule matching, it ultimately boosts overall performance by over 30%.

The high-speed short-rule bypass algorithm shuffle-and plays a key role in efficiently processing short rules in a rule set.

Single-byte short rules
The following describes how the shuffle-and algorithm works. Assume that a rule set contains eight 1-byte rules expressed as {A, B, C, D, c, d, e, f}. Each rule is assigned a group, and the group ID of each rule is represented by a bit. The group IDs of the rules are {0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80}. The entire ASCII table is expressed as a 16 x 16 grid, where characters with the same high-order 4-bit value are placed in the same row, and characters with the same low-order 4-bit value are placed in the same column. Then, enter the group ID of each rule in the corresponding grid. See the following figure.

Perform a bitwise OR operation on the values in each row of the table to obtain the vector shown in the second column of the preceding figure, which is recorded as maskHigh. Perform a bitwise OR operation on the values in each column of the table to obtain the vector shown in the second row of the preceding figure, which is recorded as maskLow. The following figure shows maskLow and maskHigh.

After the preceding preparations are complete, maskLow and maskHigh are used as source vectors for the shuffling operation, and the data to be matched is used as control vectors.

Assume that a 16-byte data block in the data to be matched is AgggDd3366666666. You can obtain the matching results of all positions with five instructions. First, express the lower four bits and upper four bits of each byte in the data block as two control vectors. This requires an AND operation to erase the upper four bits of each byte and a shift operation to move the upper four bits of each byte to the lower four bits. Then, you can obtain two vectors inputLow and inputHigh.

Then, two shuffling operations are performed to obtain the results shufLow and shufHigh as follows:

Perform an AND operation on shufLow and shufHigh to obtain the matching result of each position.

As shown in the preceding figure, offsets 0, 4, and 5 match group 0x01 (A), group 0x08 (D), and group 0x20 (d), respectively.

2- to 4-byte short rules
The 2- to 4-byte rule bypass algorithm of the Ultrascan short-rule bypass engine can process a maximum of eight 2- to 4-byte rules. The rules are encoded as 1 byte, with each bit representing a rule. The following uses a 3-byte rule as an example to describe the matching algorithm.

Assume that there is a 3-byte rule ddy. Each byte is divided into higher four bits and lower four bits and filled in a table with 6 rows and 16 columns, as shown in the following figure. Each cell of the table represents a byte.

Each bit can express the information of one rule. In this example, there is only one rule, and each cell uses only one bit to express the rule. Each row in the table corresponds to a half-byte of the rule, with 0 indicating a hit, and 1 indicating a miss.

The preceding figure shows an example of the 3-byte rule bypass algorithm matching for a 16-byte content yourdaddyisbravo. The content is read into the 128-bit vector register through the LDR instruction. Each byte is split into lower four bits and higher four bits. TBL parallel table lookup is performed on rows 0, 2, and 4 and rows 1, 3, and 5 of the table, respectively, to obtain a result of six rows. Then, the OR operation is performed on rows 0 and 1, rows 2 and 3, and rows 4 and 5, respectively, to obtain a result of three rows, which represents the matching result of the 3-byte rule at each position. The result of the first byte is shifted left by two bytes, and the result of the second byte is shifted left by one byte. Then, the OR operation is performed on the third byte and the shifted two bytes to obtain the accurate sequence matching result. In this example, the rule ddy is matched at position 8.

False-Positive Blocking Technology

The rule sets deployed on the customer's live network sometimes contain a few rules with special fragments, causing excessive false positives in multi-pattern matching. This triggers a large number of interpreter calls and inefficient long-rule verification, but yields zero true matches. These unnecessary interpreter calls become computing hotspots, undermining the pre-filtering capability of multi-pattern matching.

In a real customer traffic environment, a few rules with bad string fragments in the core rule set generated over 5 million false hits in multi-pattern matching. Interpreter calls and complete long-rule verification became the computing hotspot, with no actual matches found. However, after these rules were removed, the hotspot was eliminated and the matching performance was boosted by 20 times.

Figure 1 False-positive blocking

Ultrascan provides a false-positive blocking model, which greatly improves the matching performance for rule sets that contain bad string fragments.

The false-positive blocking model controls the behavior of the graph partitioning algorithm during the compilation phase. It checks bad fragments on candidate partitions in the NFA graph to intercept the partitions belonging to bad fragments. This prevents bad fragments from entering multi-pattern matching, thereby reducing false hits and filtering out meaningless verifications. In addition, it leverages vector instructions to optimize inefficient long-rule verification algorithms within the interpreter during the runtime phase. Through the preceding two measures, it significantly improves the overall performance.

Universal Bytecode Technology

In complex information security and data distribution applications, heterogeneous and distributed device deployment has become the norm. In traditional mode, the bytecode generated during rule compilation is often deeply bound to specific hardware micro-architectures (such as the x86 instruction sets or Arm architecture features). As a result, repeated compilation is required for different devices on a hybrid-architecture network, greatly increasing the complexity of rule distribution and O&M costs.

To address this, the universal database function is developed. It decouples the compilation product from the underlying hardware architecture. Through standardized bytecode encapsulation and adaptive loading technologies, compilation databases generated on the x86 or Arm platform can run across platforms. This function addresses the deployment challenges in scenarios where devices of multiple generations and architectures coexist: (1) Users only need to compile rules once on the central control end and the generated universal bytecode can be then seamlessly delivered to network-wide edge nodes in different architectures; (2) This eliminates the need to maintain architecture-specific rule libraries, significantly reducing storage and version management costs; (3) This shortens the path from rule update to rule validation, ensuring the consistency and efficiency of data distribution services in a heterogeneous distributed environment.

Figure 2 Universal bytecode technology

As shown in the figure above, during compilation, optimized bytecode is generated for each preset platform and packaged into the same database file. During compilation, an offset field is set in the database file to record the relative positions of bytecode for the x86 and Arm platforms in memory. During runtime, the CPU features of the current platform are detected, and then the bytecode supported by the platform is loaded from the database for execution. This design implements one-time compilation for cross-platform execution. With layered storage optimization, it also improves the solution universality.

Parent topic: Features