Running and Verification
After ScaNN is compiled and installed, obtain the test data set to run and verify ScaNN.
- Go to the planned path for verifying ScaNN.
1cd /path/to/scann_test
- Download the test data set.
1wget http://ann-benchmarks.com/glove-100-angular.hdf5 --no-check-certificate
- Create and edit the ScaNN test script scann_test.py.
- Create a scann_test.py file.
1vi scann_test.py - Press i to enter the insert mode and add the following content to the scann_test.py file:
import numpy as np import h5py import time import scann def compute_recall(neighbors, true_neighbors): total = 0 for gt_row, row in zip(true_neighbors, neighbors): total += np.intersect1d(gt_row, row).shape[0] return total / true_neighbors.size def main(): print("Load dataset: glove-100-angular.hdf5") glove_h5py = h5py.File("glove-100-angular.hdf5", "r") print("Dataset keys:", list(glove_h5py.keys())) dataset = glove_h5py['train'] queries = glove_h5py['test'] print("Train size: ", dataset.shape) print("Queries size:", queries.shape) print("\nCreate ScaNN searcher") start = time.time() normalized_dataset = dataset / np.linalg.norm(dataset, axis=1)[:, np.newaxis] searcher = scann.scann_ops_pybind.builder(normalized_dataset, 10, "dot_product").tree( num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah( 2, anisotropic_quantization_threshold=0.2).reorder(100).build() end = time.time() print("Time (s):", end - start) print("\n1.Batched-query: queries") start = time.time() neighbors, distances = searcher.search_batched(queries) end = time.time() print("Recall:", compute_recall(neighbors, glove_h5py['neighbors'][:, :10])) print("Time (s):", end - start) print("\n2.Single-query: queries[0]") start = time.time() neighbors, distances = searcher.search(queries[0], final_num_neighbors=5) end = time.time() print("neighbors:", neighbors) print("distances:", distances) print("Time (ms):", 1000*(end - start)) if __name__ == "__main__": main() - Press Esc, type :wq and press Enter to save the file and exit.
- Create a scann_test.py file.
- Run the test.
1python3 scann_test.py
According to the command output, the test program loads the glove-100-angular data set (which has 100 dimensions, about 1 million training data records, and 10,000 query data records), creates a ScaNN searcher, and queries data in two modes:
- Batched mode: All data sets are queried in batches. In this mode, the recall rate is 0.89965.
- Single mode: The data whose index is 0 in the data set is queried. The five nearest neighbors and the distances are returned.
If the test program reports no error, the recall rate in batched mode is similar to that in the preceding figure, and the data queried in single mode is the same as that in the preceding figure, ScaNN is working properly.