Rate This Document
Findability
Accuracy
Completeness
Readability

Adding the KScaNN Algorithm

ANN-Benchmarks contains many algorithms. You can also use Huawei-developed algorithm KScaNN to search for datasets. To add the KScaNN algorithm, perform the following steps:

  1. Add the implementation of the KScaNN algorithm.
    1. Open the module.py file.
      1
      vim /data/ann-benchmarks-main/ann_benchmarks/algorithms/milvus/module.py 
      
    2. Add the following content at the end of the file:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      class MilvusKSCANN(Milvus):
          def __init__(self, metric, dim, index_param):
              super().__init__(metric, dim, index_param)
              self._index_n_leaves = index_param.get("n_leaves", None)
              self._index_dims_per_block = index_param.get("dims_per_block", None)
              self._index_avq_threshold = index_param.get("avq_threshold", None)
              self._index_soar_lambda = index_param.get("soar_lambda", None)
              self._index_overretrieve_factor = index_param.get("overretrieve_factor", None)
              self._index_train_thread = index_param.get("train_thread", None)
      
          def get_index_param(self):
              return {
                  "index_type": "KSCANN",
                  "params": {
                      "n_leaves": self._index_n_leaves,
                      "dims_per_block": self._index_dims_per_block,
                      "avq_threshold": self._index_avq_threshold,
                      "soar_lambda": self._index_soar_lambda,
                      "overretrieve_factor": self._index_overretrieve_factor,
                      "train_thread": self._index_train_thread
                  },
                  "metric_type": self._metric_type
              }
      
      
          def set_query_arguments(self, query_args):
              nprobe, reorder, adp_threshold, adp_refined, num_thread, batch_size = query_args
              self.search_params = {
                  "metric_type": self._metric_type,
                  "params": {
                      "nprobe": nprobe,
                      "reorder": reorder,
                      "adp_threshold": adp_threshold,
                      "adp_refined": adp_refined,
                      "num_thread": num_thread,
                      "batch_size": batch_size,
                      "k": 10
                  }
              }
              self.name = f"MilvusKScaNN metric:{self._metric}, index_n_leaves:{self._index_n_leaves}, index_dims_per_block:{self._index_dims_per_block}, index_avq_threshold:{self._index_avq_threshold}, index_soar_lambda:{self._index_soar_lambda}, index_overretrieve_factor:{self._index_overretrieve_factor}, index_train_thread:{self._index_train_thread}, search_nprobe:{nprobe}, search_reorder:{reorder}, search_adp_threshold={adp_threshold}, search_adp_refined={adp_refined}, search_num_thread={num_thread}, search_batch_size={batch_size}"
      
  2. Add the KScaNN algorithm configuration.
    1. Open the config.yaml file.
      1
      vim ann_benchmarks/algorithms/milvus/config.yaml
      
    2. Add the following content at the end of the file:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      - base_args: ["@metric", "@dimension"]
        constructor: MilvusKSCANN
        disabled: false
        docker_tag: ann-benchmarks-milvus
        module: ann_benchmarks.algorithms.milvus
        name: milvus-kscann
        run_groups:
          KScaNN1:
            args:
              n_leaves: [2000]
              dims_per_block: [4]
              avq_threshold:
              soar_lambda: [-1]
              overretrieve_factor: [-1]
              train_thread: [16]
            query_args:
              # [
              #   [210],  # nprobe
              #   [900],  # reorder
              #   [0.2],  # adp_threshold
              #   [0],  # adp_refined
              #   [1], # num_thread
              #   [1] # batch_size
              # ]
              [[[10, 100, 0, 0, 1, 1], [15, 140, 0, 0, 1, 1], [25, 160, 0, 0, 1, 1], [35, 190, 0, 0, 1, 1], [40, 200, 0, 0, 1, 1], [45, 220, 0, 0, 1, 1], [50, 240, 0, 0, 1, 1], [60, 250, 0, 0, 1, 1], [70, 300, 0, 0, 1, 1], [80, 400, 0, 0, 1, 1], [100, 500, 0, 0, 1, 1], [120, 600, 0, 0, 1, 1], [150, 800, 0, 0, 1, 1], [200, 900, 0, 0, 1, 1], [250, 900, 0, 0, 1, 1]]]
      

Table 1 describes the KScaNN parameters. The reference values are determined based the query result precision, memory consumption, and time consumption. You can set the parameters as required.

Table 1 KScaNN parameters

Parameter

Description

Value Type and Range

Configuration Reference

Configuration Principle

n_leaves

Number of leaf nodes.

Integer

[2000]

This parameter affects the graph construction time and final index quality. If the value is too large, the construction time may be too long and the search performance may deteriorate. If the value is too small, the search precision may be affected.

dims_per_block

Number of dimensions that form a sub-vector block in the product quantization (PQ) phase during graph construction.

Integer

[4]

The value 4 is recommended. You may adjust the value as required.

avq_threshold

AVQ threshold during graph construction.

Float

None

This parameter affects the pruning policy. Generally, the value 0.2 is used for the IP dataset. For the L2 dataset, this parameter is left empty.

soar_lambda

Orthogonality configuration. This parameter takes effect only for the IP dataset.

Float, greater than 0

[-1]

-1 indicates that this parameter is not used. When using the IP dataset, you may adjust the value as required.

overretrieve_factor

Used together with soar_lambda to specify the over-retrieval factor. This parameter takes effect only for the IP dataset.

Float, [1, 2]

[-1]

-1 indicates that this parameter is not used. When using the IP dataset, you may adjust the value as required.

train_thread

Number of threads during graph construction.

Integer

[Number of CPU cores]

Set this parameter to the number of CPU cores unless otherwise specified.

nprobe

Number of subspaces that a complex query will search within.

Integer, [1, n_leaves]

[200]

You may adjust the value as required.

reorder

Number of results saved before reordering.

Integer, [k, Number of records in the dataset]

[900]

k indicates the number of final results returned. You may adjust the value as required.

adp_threshold

Adaptive truncation threshold during query. Reserved.

Float, [0, 0.8]

[0]

You may adjust the value as required.

adp_refined

Number of subspaces that a simple query will search within. Reserved.

Integer, [0, nprobe]

[0]

The typical value is 0. However, the value range is [1, nprobe] for search and recommendation scenarios. In this document, the value range is [0, probe]. You may adjust the value as required.

num_thread

Number of threads during query.

Integer, greater than or equal to 1

[1]

Set it to 1 unless otherwise specified.

batch_size

Size of the preferred batch during automatic parallel batching.

Integer, greater than or equal to 1

[1]

Set it to 1 unless otherwise specified.