This program implements several methods for solving the cosine similarity based K-Nearest Neighbor Graph construction problem, including Greedy Filtering [3], Maxscore [2], BMM [2], kIdxJoin [1], kL2AP [1], L2Knng [1], and L2KnngApprox [1]. Details for the methods can be found in [1].
paper
tarball
README
Contact me via email if you need additional information or find any bugs: david [period1] anastasiu [atSign] sjsu [period2] edu.

Please cite the following paper if you make use of this program or any of its components in your research.

David C. Anastasiu and George Karypis. L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning. In 24th ACM International Conference on Information and Knowledge Management, CIKM '15, 2015.

@inproceedings{anastasiu2015b,
    author = {Anastasiu, David C. and Karypis, George},
    title = {L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning},
    booktitle = {24th ACM International Conference on Information and Knowledge Management},
    series = {CIKM '15},
    year = {2015},
    location = {Melbourne, Australia},
    numpages = {10},
    doi = {http://dx.doi.org/10.1145/2806416.2806534}
}

References:

[1] David C. Anastasiu and George Karypis. L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning. In 24th ACM International Conference on Information and Knowledge Management, CIKM '15, 2015.
[2] Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM '13, pages 113-122, New York, NY, USA, 2013. ACM.
[3] Youngki Park, Sungchan Park, Sang-goo Lee, and Woosung Jung. Greedy filtering: A scalable algorithm for k-nearest neighbor graph construction. In Database Systems for Advanced Applications, volume 8421 of Lecture Notes in Computer Science, pages 327-341. Springer-Verlag, 2014.
[4] Wei Dong, Charikar Moses, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 577–586, New York, NY, USA, 2011. ACM.

Erratum:

The experiment results we published in [1] for the NN-Descent baseline [4] were accidentally executed in parallel, with 8 threads, instead of serially. Since our methods and all other baselines are serial programs, we have re-executed the affected experiments, using the same version of the NN-Descent library (v. 1.2) and on the same computing environment as we used for the experiments in [1]. We detail the updated results below.

Candidate pool size parameter analysis (Section 6.1.1 in [1])

In this experiment, we compared the recall and execution time of our approximate method, L2KnngApprox, with baselines NN-Descent and Greedy Filtering, when increasing the candidate pool size parameter μ from k to 10k, in increments of k, where k is the number of desired neighbors in each neighborhood. We show experiment results for two datasets, RCV1-400k and WW200-250k, and two k values, 50 and 100. Figure 1 below is the equivalent of Figure 1 from [1]. NN-Descent took considerably more time than the other methods to complete the graph construction, making the results of the other methods hard to decipher. Figure 1b thus shows the same results, excluding the NN-Descent method.

Figure 1: Recall and execution time of approximate methods given increasing candidate pool sizes.
        Figure 1b: Recall and execution time of approximate methods given increasing candidate pool sizes. NN-Descent results excluded.
Figure 1: Recall and execution time of approximate methods given increasing candidate pool sizes.         Figure 1b: Same as Figure 1, excluding NN-Descent results.
 

Figures 1 and 1b plot recall versus execution time for our experiment results. For all methods, results for μ = k are marked with a "-" label, and those for μ = 10k with a "+" label. The best results are those points in the lower-right corner of each quadrant in the figure, achieving high recall in a short amount of time.

The conclusions in [1] remain valid given updated experiment results:

Methods generally exhibit higher recall and higher execution time for larger μ values. L2KnngApprox_0 takes much less time to execute than Greedy Filtering and, given large enough μ, can achieve similar or higher recall. Both L2KnngApprox and Greedy Filtering require larger μ values than NN-Descent to achieve high recall. Yet, NN-Descent does not improve much as μ increases. L2KnngApprox_3 is able to outperform both competitors, with regards to both time and recall, for large enough μ.

L2KnngApprox efficiency (Section 6.1.2 in [1])

The research focus in [1] was building the exact nearest neighbor graph (in the case of L2Knng, or nearly exact in the case of L2KnngApprox, i.e., one with high -- at least 0.95 -- average recall). In this experiment, we compare execution times needed to achieve at least 0.95 average recall by each of the approximate methods, for k ranging from 10 to 100, over four datasets. Each method was executed under a large range of input parameters and we report, for each method, the best time for each k value when recall was at least 0.95. We also include the times for our exact variant, L2Knng, as comparison.

Figure 2: Approximate k-NNG construction efficiency.
Figure 2: Approximate k-NNG construction efficiency.

As in [1], we were not able to achieve high enough recall for NN-Descent for the WW200 dataset. As such, NN-Descent results are not included in the upper left quadrant of the figure. In each quadrant of the figure, smaller values represent better results. Note that the total time values (y-axis) are log-scaled.

The conclusions in [1] remain valid with regards to the comparison with the Greedy Filtering method. Additionally, our method, L2KnngApprox, outperforms NN-Descent in all cases.