Billion-Scale Vector Search: Team Sisu and BuddyPQ
This blog goes into detail about the work we did as Team Sisu for the Billion-Scale Approximate Nearest Neighbor Challenge: https://big-ann-benchmarks.com/index.html
Long story short: we have improved recall by 12% on 10M dataset over baseline FAISS.
Update: We have presented part of this work in the NLP meetup by deepset:
Team:
Dmitry Kan, Alexander Semenov, Aarne Talman, Leo Joffe— Silo.AI
Max Irwin — OpenSource Connections
Alex Klibisz — https://alexklibisz.com/
Areas of research
We focused on the following research directions:
- Statistical analysis for vector recomposition in FAISS: BuddyPQ
- Sharding — new algorithms
- Language-agnostic benchmark evaluation
- Scaling Lucene-based algorithms
In this blog post we will cover the first item, leaving the rest to their own blog posts.
We are thankful to Microsoft Azure team for kindly providing grant for our research which enabled us to train our models and perform large-scale indexing and search on Azure computing platform.
Statistical analysis for vector recomposition in FAISS: BuddyPQ
We introduce an algorithm entry ‘buddy-t1’, which is an implementation of BuddyPQ — a preprocessing/enrichment step for datasets used with Product Quantization for Approximate Nearest Neighbor search.
BuddyPQ seeks to improve on FAISS IVF-PQ with a simple yet effective technique of grouping similar dimensions together (“buddy dimensions”) prior to subvector assignment in product quantization. We show significant improvement for three datasets in the 10M size category, and with further work may see improvement on at least one 1B sized dataset. This technique is generalized and can be applied to any Product Quantization search, and not only FAISS.
Background
Product Quantization works by first splitting a larger D-dimensional vector into m subvectors, and then finding k centroids for each subvector. Using, for example the BIGANN dataset, D=128, if each subvector is sized 8, then the number of subvectors m=16. Once the subvectors are split, Product Quantization uses various techniques to find centroids to represent each subvector.
Hypothesis
BuddyPQ hypothesizes that the subvectors’ centroids found during product quantization will more effectively represent that subvector space if all its values have a) lower overall variance, b) similar distributions, and/or c) multicollinearity. For some optimized arrangement of these subvectors, the index and query should yield higher recall and be faster to query compared to an unoptimized arrangement of subvector dimensions.
For example, imagine a dataset with dimensions 0,1,2,3,4,5,6,7 and a subvector size of 4. Traditionally in product quantization, this will result in two subvectors for each point, with dimensions 0,1,2,3 and 4,5,6,7 respectively. However, BuddyPQ postulates that if dimensions 0,3,5,6 are correlated with each other in some way, then it follows that the subvectors should instead be 0,3,5,6 and 1,2,4,7. Importantly, this rearrangement must be performed for each vector during index time, and each query vector.
Null Hypothesis
Prior to finding optimized subvector arrangement, we must first show that any rearrangement will yield different results. We therefore randomized the subvector choice for one dataset (SSNPP-10M) to test this theory, showing the following:
We can see in Table 1, that random subvector assignment does in fact change recall, and can therefore be optimized. The random assignment chosen for the above test is as follows:
As SSNPP is D=256, then its random subvector arrangement is in randos[‘256’]. Note that improvement was found for at least one query configuration (in row 2 of Table 1 above). Naively, a genetic algorithm can be implemented to fit the dataset with random assignments evolved to keep even more ideal subvectors. However this will take a long time, even for small datasets, and was not attempted. Instead, we investigated statistical rearrangement as our optimization, as covered in the next section.
Methods
It is trivial to rearrange dimensions of vectors prior to indexing or querying using numpy, so most of the work in this area is aimed at identifying which dimensions should be rearranged as subvector buddies. The efforts investigated are as follows:
- Distribution similarity, using Kolmogorov-Smirnov (KS) tests
- Multicollinearity, using Variance Inflation Factor (VIF) tests
We proceed through each method test to obtain similarity values for all dimension pairs, resulting in a DxD matrix for each method. For BIGANN for example, there are 128 dimensions and therefore a matrix of 128x128 will hold a test for each pair combination of dimensions for Distribution similarity, and another equally sized matrix for Multicollinearity.
The investigative code for finding the buddy optimizations and illustrated in this section can be found on github here: https://github.com/DmitryKey/big-ann/tree/main/src/algorithms/sharding/clustering
Kolmogorov-Smirnov (KS)
For Distribution similarity, we use Kolmogorov-Smirnov (KS): “In statistics, the Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution, or to compare two samples. It is named after Andrey Kolmogorov and Nikolai Smirnov.”
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
Using KS tests, we will find the similarity between the distributions of two dimensions, giving a result between 0.0 and 1.0, where 0.0 is most similar and 1.0 is least similar.
For KS test distribution similarity for all dimension pairs of a given dataset (stored in data_file), the python code snippet for finding this is as follows:
We can visualize the resulting ks matrix for each dataset using a heatmap, where the warmer areas are less similar, and the darker areas more similar, in distribution between the two dimension pairs (x,y).
We processed this for every dataset in the competition, resulting in the following KS-test heatmaps for the first 100,000 points in the three dimensions BIGANN, DEEP, and MSSPACEV respectively:
It is clear from the above figures that some dimensions show strong distribution overlap, while others do not. We can therefore use this matrix to find subvectors that overall have highly-similar distributions, which may make them more amenable to centroid discovery and performance.
Multicollinearity (VIF)
For Multicollinearity, we use Variance Inflation Factor (VIF): “In statistics, the variance inflation factor is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis”
https://en.wikipedia.org/wiki/Variance_inflation_factor
As we are not removing dimensions (as BuddyPQ does not explore dimensionality reduction), we instead use the VIF to group collinear dimensions together as buddies prior to subvector assignment.
For multicollinearity using VIF, the python code snippet for finding this is as follows:
We can visualize this VIF matrix using a heatmap. Shown in this figure is the VIF heatmap for the first 100000 points of BIGANN:
There is significant confidence in the figure shown that there are collinear relationships between some dimensions.
However, some datasets do not exhibit VIF multicollinearity, for example in SSNPP:
We can see for SSNPP the highest VIF value is less than ~0.0125, which is very low compared to that of BIGANN’s ~2.25
Here are the heatmaps for the other two entries: DEEP and MSSPACEV, each at 100000 points, shown respectively:
Buddy choices per Dataset
Using the two techniques above, we find buddies for each dataset by naively sorting relationship edge weights in a network. Given a dataset similarity matrix A (from either KS, VIF, or a pooling of the two), we use the values as edge weights between two dimensions represented as nodes in a network. As the original network is complete (fully connected), we prune it to contain the edges over a certain weight percentile (in our examples we used the 75th percentile or higher, resulting in the connected network with density equal to 0.113).
After assembling the graph, we then traverse it using a BFS traversal to a distance of S (subvector size) and print the buddies found for the dataset using the following algorithm:
This results in the following table:
multicollinearity_buddies = {
‘bigann’: [54, 52, 48, 23, 78, 82, 84, 80, 113, 42, 83, 85, 81, 43, 51, 53, 49, 77, 75, 79, 40, 41, 47, 73, 112, 114, 111, 118, 87, 72, 16, 22, 9, 18, 122, 120, 91, 24, 30, 17, 62, 60, 56, 31, 34, 36, 32, 67, 1, 90, 92, 88, 121, 15, 0, 13, 70, 68, 64, 103, 37, 102, 96, 69, 8, 10, 14, 104, 110, 106, 44, 46, 55, 74, 76, 61, 59, 63, 35, 33, 2, 66, 38, 89, 95, 65, 71, 58, 94, 4, 6, 98, 100, 26, 28, 11, 107, 109, 124, 126, 19, 21, 115, 117, 119, 57, 39, 3, 5, 7, 12, 20, 25, 27, 29, 45, 50, 86, 93, 97, 99, 101, 105, 108, 116, 123, 125, 127],
‘deep’: [21, 0, 22, 20, 67, 40, 88, 43, 89, 13, 32, 79, 50, 2, 31, 39, 53, 1, 24, 15, 3, 26, 63, 70, 42, 93, 57, 85, 35, 61, 29, 46, 10, 76, 49, 45, 19, 41, 18, 58, 27, 4, 47, 12, 74, 14, 9, 44, 56, 60, 65, 8, 16, 86, 87, 94, 77, 36, 59, 71, 7, 80, 33, 37, 92, 82, 55, 5, 54, 38, 66, 51, 52, 30, 69, 83, 11, 91, 62, 34, 25, 48, 6, 73, 64, 68, 84, 78, 23, 90, 17, 81, 72, 75, 28, 95],
‘msspacev’: [85, 0, 86, 84, 83, 82, 25, 39, 90, 26, 18, 20, 54, 98, 15, 95, 65, 23, 70, 51, 73, 76, 92, 21, 64, 40, 67, 87, 19, 66, 75, 96, 6, 16, 91, 78, 81, 79, 34, 55, 71, 37, 24, 5, 52, 72, 74, 60, 77, 99, 56, 97, 9, 53, 50, 12, 93, 2, 3, 1, 58, 57, 88, 49, 68, 35, 61, 7, 43, 80, 89, 17, 38, 13, 8, 22, 69, 28, 48, 47, 29, 62, 30, 32, 33, 31, 42, 36, 10, 4, 94, 63, 46, 45, 44, 14, 11, 41, 27, 59]
}
The buddies can be calculated quickly (in less than 5 minutes) for any given dataset, using the methods above. Specifically, we obtained the above buddies table by running the following three commands from the github link provided earlier:
python network_multicollinearity.py config_bigann_small
python network_blended.py config_deep_small
python network_blended.py config_msspacev_small
Implementation
We introduce the algorithm ‘buddy-t1’ for our Billion-Scale ANN Search Challenge entry, buddy-t1 was implemented by making a copy of the faiss-t1 algorithm and modifying it to include BuddyPQ subvector assignment. The implementation makes use of pre-calculated arrangement of vector dimensions using the techniques described in the above Method section. This results in one small function:
In the buddy_up function above, the friends variable on line 8 is obtained by looking up the predefined arrangement in the multicollinearity_buddies table found in the previous section.
On line 11, the numpy array of points is rebroadcast to the arrangement in ‘friends’, and must be recopied in memory to be C-contiguous for use with Faiss. This re-copy process introduces overhead for both indexing and querying (as queries must be rearranged to the same subvector configuration as during indexing). However results still showed increased queries per second for range queries over unmodified faiss-t1.
Note the variable dataset_prefix only uses the name of the dataset, and not the size (for example ‘bigann’ and not ‘bigann-10M’)
Results
We are able to show that buddy-t1 introduces significant improvement for three datasets in the 10-million points category over baseline faiss-t1. We were unable to show improvement for any 1B sized dataset, but we do not yet know why and are currently investigating. Even so, our results are impressive for smaller sized datasets, with some query configurations improving by over 12% recall!
The following bar charts show recall comparisons for the datasets BIGANN-10M, DEEP-10M, and MSSPACEV-10M, respectively. buddy-t1 is shown in blue and unmodified faiss-t1 in orange. Though we were only able to choose 10 query parameters for each dataset for the competition, we showed significant improvement for more than 10 combinations for each chosen dataset. Higher bars are better.
Plotted as QPS vs Recall, we show that buddy-t1 clearly outperforms faiss-t1 for our three 10M sized datasets:
Conclusion and further work
While we were unable to improve upon a 1-billion size dataset with our method, we see significant value in using this general technique for vector preprocessing when using Product Quantization for datasets in the 10M or 100M size. Also, the buddy selection is currently only semi-optimized given that we ran out of time in the competition. With further work and iteration, a more ideal buddy selection for subvectors can be found for each dataset, with the potential for even better recall.
If you would like to run the code yourself, it is maintained here: https://github.com/DmitryKey/big-ann