Examples

The following uses the enwiki-latest-pages-articles dataset as an example to describe how to test the performance of the KNewPfordelta APIs.

Configuring the Dataset Generation Environment

Obtain the source code of the KNewPfordelta APIs as instructed in Compiling KNewPfordelta. The test framework code can be obtained from the source code.
Compile the source code as instructed in Compiling KNewPfordelta.

Install Python 3.

yum install python3 python3-devel python3-pip

Install conda.
1. Obtain conda.
2. Upload the downloaded .sh file to the server and run the following command (using Anaconda3-2025.06-1-Linux-aarch64.sh as an example):
```
bash Anaconda3-2025.06-1-Linux-aarch64.sh
```
  During installation, the license terms will be displayed. Press Enter repeatedly to scroll through the content. When prompted with "Do you accept the license terms?", type yes to proceed. When prompted with "Do you wish to initialize Anaconda?", type yes.
3. When the installation is complete, run the following command to activate the Anaconda environment:
```
source ~/.bashrc
```
4. Run the following command to verify the installation:
```
conda --version
```
Install the Python module.
```
conda install nltk numpy
```

Install nltk_data.

Obtain nltk_data.

git clone https://github.com/nltk/nltk_data.git

Rename the packages file in the home directory to nltk_data.
```
cd /path/to/nltk_data
mv packages nltk_data
```
Place nltk_data in a searchable directory.
Assume that nltk_data is stored in /usr/local/nltk_data. Unzip punkt.zip and punkt_tab.zip in /usr/local/nltk_data/tokenizers/.
```
cp -r ./nltk_data /usr/local
cd /usr/local/nltk_data/tokenizers
unzip punkt.zip
unzip punkt_tab.zip
```

Install punkt_tab and stopwords.

python
>>> import nltk
>>> import ssl
>>> try:
...     _create_unverified_https_context = ssl._create_unverified_context
... except AttributeError:
...     pass
... else:
...     ssl._create_default_https_context = _create_unverified_https_context
... 
>>> nltk.download('punkt_tab')
>>> nltk.download('stopwords')
>>> exit()

Generating Datasets

The following describes how to generate an inverted index dataset for articles.

Go to the dataset generation directory and download the publicly available wiki articles dump in .bz2 format, for example, enwiki-latest-pages-articles1.xml-p1p41242.bz2.
```
cd /path/to/knewpfordelta/test/gen_data
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2 --no-check-certificate
```

Convert the downloaded data file into a JSON file.

python parse_articles_xml.py enwiki-latest-pages-articles1.xml-p1p41242.bz2

Convert the JSON file into a term-frequency file term_list_articles.txt.
```
python export_terms_and_freq.py inverted_index_articles.json
```
Convert the JSON file to a binary file containing document IDs (doc_id_chunk_articles.bin).
```
python rebuild_inverted_index_for_chunk.py inverted_index_articles.json
```

Performing Tests

Perform a decompression test.
Go to the test directory.
```
cd /path/to/knewpfordelta
```
- Perform a performance test. The input parameters are the term-frequency .txt file and the binary .bin file.
```
numactl -m 0 -c 0 ./newpfordelta_perf /path/to/term_list_articles.txt /path/to/doc_id_chunk_articles.bin
```
  The performance test result of KNewPfordelta is as follows:
  
  speed indicates the decompression performance result.
- Perform a function test.
```
numactl -m 0 -c 0 ./newpfordelta_ut /path/to/term_list_articles.txt /path/to/doc_id_chunk_articles.bin
```
  The function test result is as follows:
  
  The command output displays the message indicating successful decompression.
Test the performance of unpack functions in handling exceptions.
1. Delete the generated files.
```
make clean
```
2. Compile KNewPfordelta.
```
make perf=1
```
3. Perform a performance test.
```
numactl -m 0 -c 0 ./newpfordelta_perf /path/to/term_list_articles.txt /path/to/doc_id_chunk_articles.bin
```
  The performance test result of KNewPfordelta is as follows:
  
  The output shows the running performance of different unpack functions, broken down into time spent on normal values and exception values. It also shows their respective proportions to the total running time for each function. Calls indicates how many times each unpack function is called.

Parent topic: KNewPfordelta APIs