我要评分
获取效率
正确性
完整性
易理解

Initialize

API Definition

  1. Status ScannInterface::Initialize(ConstSpan<float> dataset, DatapointIndex n_points, const std::string& config, int training_threads);
  2. Status ScannInterface::Initialize(ConstSpan<float> dataset, DatapointIndex n_points, const std::string& config, int training_threads, GmmUtils::KMeansParams kmOpt);
  3. Status ScannInterface::Initialize(ConstSpan<float> dataset, DatapointIndex n_points, const std::string& config, int training_threads, GmmUtils::KMeansParams kmOpt, float filter_thr, int filter_type);

Function

  1. Index construction (consistent with the open source algorithm).
  2. Index construction. An overload of the Initialize function. It sets extra parameters for tunning K-means clustering in both IVF and PQ (KScaNN-specific API).
  3. Index construction. Another overload of the Initialize function. It sets extra parameters for K-means clustering and component filtering in both IVF and PQ (KScaNN-specific API).

Parameters

Parameter

Data Type

Description

Value Range

dataset

ConstSpan<float>

Base library vector.

The value cannot be null.

n_points

DatapointIndex

Number of vectors in the base library.

The length must be the same as that of dataset. dataset indicates the base library vector.

config

const std::string&

Configuration file required for creating the index, containing all configuration parameters.

-

training_threads

int

Number of threads during index construction.

≥ 1.

kmOpt

GmmUtils::KMeansParams

K-means tuning parameters.

-

filter_thr

float

Filter threshold.

[0, 1]. The default value is 0.

filter_type

int

Filter type.

  • 0: filtering based on the number of zero elements in the vector. The default value is 0.
  • 1: filtering based on the deviation of elements from the mean value.

config_pbtxt

const std::string&

Configuration file required for loading the index.

-

scann_assets_pbtxt

const std::string&

Index file list.

-

The KMeansParams structure of kmOpt contains the following parameters. Table 1 describes the parameters.
struct KMeansTunableExtraParams {
        int32_t iter;
        int32_t sample;
        int32_t init; 
};

struct KMeansParams { 
        KMeansTunableExtraParams ivf; 
        KMeansTunableExtraParams pq; 
};
Table 1 Parameters of the KMeansParams structure

Parameter

Data Type

Description

Value Range

iter

int32_t

Number of K-means algorithm iterations.

[0, MAXINT]. The default value is 0.

sample

int32_t

Sample size.

-

init

int32_t

Initialization type of the K-means cluster center.

{0, 1, 2, 3}. The default value is 0.

  • 1: initialization based on the average distance.
  • 2: initialization based on the K-means++ algorithm.
  • 3: random initialization.
  • 0: Do not enable K-means optimization in PQ.

config is generated by create_config.py based on the parameters described in Table 2.

Table 2 Parameters

Parameter

Data Type

Description

Value Range

n_leaves

int

Total subspace number in the IVF partition.

≥ 1.

nb

int32_t

Number of vectors in the base library.

The length must be the same as that of dataset. dataset indicates the base library vector.

metricType

std::string

Distance type of the vector.

dot_product or squared_l2.

dims_per_block

int

Number of dimensions combined by PQ.

[1,dim], where dim indicates the dimension of the base library vector.

avq_threshold

float

Asymmetric bucket parameter. This parameter takes effect only for the L2 (squared_l2) dataset.

[0,1]

dim

int32_t

Dimensions of vectors in the base library.

The dimensions must be the same as those of dataset. dataset indicates the base library vector.

topK

int

Number of returned results.

≥ 1.

soar_lambda

float

Controls orthogonality. This parameter takes effect only for the IP (dot product) dataset.

> 0. Set the value to −1 t o disable the function.

overretrieve_factor

float

Used together with soar_lambda to specify the over-retrieval factor. This parameter takes effect only for the IP (dot_product) dataset.

[1, 2]. Set the value to −1 t o disable the function.

config is generated using the following command:
python create_config.py  + std::to_string(n_leaves) + " "
                         + std::to_string(nb) + " "
                         + metricType + " "
                         + std::to_string(dims_per_block) + " "
                         + std::to_string(avq_threshold) + " "
                         + std::to_string(dim) + " "
                         + std::to_string(topK) + " "
                         + std::to_string(soar_lambda) + " "
                         + std::to_string(overretrieve_factor)

Return Value

Data Type

Description

Status

Execution status of the method. You can determine whether the method is successfully executed by calling status.ok().