Efficiently Search Your Multidimensional Data With KDTrees (2024)

Testing SciPy and Scikit-learn implementations of the KDTree algorithm

Published in

Better Programming

4 min read

Mar 15, 2023

Efficiently Search Your Multidimensional Data With KDTrees (3)

A while back, I wrote a series of posts about a tool I designed to sort my photos based on date/time and location based on available metadata. In that project, I used the PostGIS database to get more experience working with it. But some of the tasks could be done in a simpler way using k-dimensional trees or “k-d trees” for short.

This article will look at k-d trees and compare two commonly used Python implementations.

The k-dimensional tree in its core is a binary tree in which, at each of the tree nodes, the algorithm generates a hyperplane that splits the data into two parts.

When given the data to partition, the algorithm normally continues until every node is pure, i.e., only contains a single data point or until a specific condition is reached. For example, a commonly used boundary condition is the required size of the leaf node.

The initial data partitioning into a tree structure takes O(n) operations in average and worst-case scenarios, so the complexity grows linearly. On the other hand, when we need to search for a data point using an existing k-d tree instance, the time complexity is only O(log n) operations.

Therefore, when using a k-dimensional tree, we sacrifice some performance in the initial stage to create a partitioning object, but after that, the query performance is extremely quick.

The two most commonly used implementations of k-d trees in Python are in SciPy and Scikit-learn libraries. To begin, we’ll import the necessary libraries. Note that I must alias the k-d tree classes from different libraries to avoid namespace conflicts.

In this example, I’ll build the tree using the data on locations of ~42,000 cities around the world which is included in the GitHub repository linked below.

And this is the output:

(42905, 11)Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3',
 'admin_name', 'capital', 'population', 'id'],
 dtype='object')

To create a tree, we need to instantiate a class and provide it an array of numerical data in the shape of [n_samples, n_attributes], which we get by extracting values for “lat” and “lng” columns from the dataframe.

coord = df.loc[:, ['lat', 'lng']].values

And now, using this data, we can create a tree instance using the SciPy version of the algorithm, as shown below:

scp_tree = scTree(coord, leafsize=1)

Building the tree with this data should take less than a second. Once the process is finished, we have an instance of a k-dimensional tree where city locations are mapped to the tree's nodes.

We can use this instance to look up as many values as we like using the query method. There are slight differences between syntax for Scikit-learn and SciPy implementations, but for the SciPy version, we need to provide an array of lookup values in the shape of [n_samples, n_attributes] while making sure that the second dimension is the same as that of the data used to build the tree.

We can optionally provide other parameters:

k — the number of nearest neighbours to return, which defaults at 1
eps — acceptable error margin when returning multiple neighbours with the k-th neighbour located no further than 1 + eps
p — Minkowski p-norm. Defaults to 2, which is the same as the regular Euclidean distance.
distance_upper_bound — maximum distance for the neighbours to be returned
workers — number of parallel processes to use.

In this case, I’ll use the default parameters for all of the above. Now, we can create an array we want the lookup to be done for and run a query:

The query method returns an array of tuples. For each of the tuples, the first value is the distance to the neighbour(s), and the second is the index of the neighbour corresponding to the index in the original array used to build the tree.

We can now query the source dataframe for the returned index value to find the closest town. Here’s how to do that:

In this case, the closest town is Culpeper, VA.

But which of the two most common implementations of KDTrees in Python to use? There are slight differences in interface and available parameters, so you can choose the one that better suits your needs.

We can also compare the speed for both options. We’ll first test the time required to parse and build the tree:

Here the Scikit-learn implementation comes some 1.6 times slower:

Scipy: 0.0114 seconds, Sklearn: 0.0181 seconds
Sklearn is 1.59 times slower

Similarly, we can compare the query performance:

Once again, SciPy leaves Scikit-learn in the dust, even increasing the gap:

Query: Scipy: 0.00002, Sklearn: 0.00004
Sklearn is 1.98 times slower

The difference becomes even bigger as the size of the tree increases. Therefore, if you want the best performance for your application, then SciPy implementation is the way to go.

K-dimensional trees are an extremely useful algorithm used in several database engines and various applications. In the example above, I used trees to search through geospatial data, but they work with any numerical data.

I hope you found the post useful and you can find the notebook in the GitHub repo below:

GitHub - pavelcherepan/kdtrees

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com