Examples
The following uses the enwiki-latest-pages-articles dataset as an example to describe how to test the performance of the KNewPfordelta interfaces.
Configuring the Dataset Generation Environment
- Obtain the source code of the KNewPfordelta interfaces as instructed in Obtaining the KNewPfordelta Source Code. The test framework code can be obtained from the source code.
- Compile the source code as instructed in Compiling KNewPfordelta.
- Install Python 3.
yum install python3 python3-devel python3-pip
- Installing Conda
- Obtain Conda.
- Upload the downloaded .sh file to the server and run the following command (using Anaconda3-2025.06-1-Linux-aarch64.sh as an example):
bash Anaconda3-2025.06-1-Linux-aarch64.sh
During installation, the license terms will be displayed. Press Enter repeatedly to scroll through the content. When prompted with "Do you accept the license terms?", type yes to proceed. When prompted with "Do you wish to initialize Anaconda?", type yes.
- When the installation is complete, run the following command to activate the Anaconda environment:
source ~/.bashrc
- Run the following command to verify the installation:
conda --version
- Installing the Python module.
conda install nltk numpy
- Install nltk_data.
- Obtain nltk_data.
git clone https://github.com/nltk/nltk_data.git
- Rename the packages file in the home directory to nltk_data.
cd /path/to/nltk_data mv packages nltk_data
- Place nltk_data in a searchable directory.
Assume that nltk_data is stored in /path/to/nltk_data. Decompress punkt.zip and punkt_tab.zip in /path/to/nltk_data/tokenizers/.unzip punkt.zip unzip punkt_tab.zip
- Obtain nltk_data.
Generating Datasets
The following describes how to generate an inverted index dataset for article classification
- Download the publicly available wiki articles category in bz2 format, for example, enwiki-latest-pages-articles1.xml-p1p41242.bz2.
cd /path/to/knewpfordelta/test/gen_data wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2 --no-check-certificate wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles2.xml-p41243p151573.bz2 --no-check-certificate wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles3.xml-p151574p311329.bz2 --no-check-certificate wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles4.xml-p311330p558391.bz2 --no-check-certificate wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles5.xml-p558392p958045.bz2 --no-check-certificate
- Convert the downloaded data files into a JSON file.
cd /path/to/knewpfordelta/test/gen_data python parse_articles_xml.py enwiki-latest-pages-articles1.xml-p1p41242.bz2 enwiki-latest-pages-articles2.xml-p41243p151573.bz2 enwiki-latest-pages-articles3.xml-p151574p311329.bz2 enwiki-latest-pages-articles4.xml-p311330p558391.bz2 enwiki-latest-pages-articles5.xml-p558392p958045.bz2
- Convert the JSON file into a character frequency file.
python export_terms_and_freq.py inverted_index_articles.json
- Convert the JSON file to a binary data file containing document IDs.
python rebuild_inverted_index_for_chunk.py inverted_index_articles.json
Test
- Perform the decompression test.Go to the test directory.
cd /path/to/knewpfordelta
- Run the performance test. The input parameters are the .text word frequency file and .bin data file.
numactl -m 0 -c 0 ./newpfordelta_perf /path/to/term_list_articles.txt /path/to/doc_id_chunk_articles.bin
The performance test result of KNewPfordelta is as follows:

speed indicates the decompression performance result.
- Run the function test.
numactl -m 0 -c 0 ./newpfordelta_ut /path/to/term_list_articles.txt /path/to/doc_id_chunk_articles.bin
The function test result is as follows:

The command output displays the message indicating successful decompression.
- Run the performance test. The input parameters are the .text word frequency file and .bin data file.
- Test the performance of unpack functions in handling exceptions.
- Delete the generated files.
make clean
- Compile KNewPfordelta.
make perf=1
- Run the performance test.
numactl -m 0 -c 0 ./newpfordelta_perf /path/to/term_list_articles.txt /path/to/doc_id_chunk_articles.bin
The performance test result of KNewPfordelta is as follows:

The output shows the running performance of different unpack functions, broken down into time spent on normal values and exception values. It also their respective proportions to the total running time for each function. Calls indicates how many times each unpack function is invoked.
- Delete the generated files.