Numpy: Searching indices

Feb 26, 2021

I was having a problem.

The Problem

I want to generate nice-looking diagrams of several time series. These time series can have gaps, and the gaps should not be connected, but the lines without gaps should be connected, obviously. Fortunately, plotly makes it very easy to create such a line:

fig.add_trace(
    go.Scatter(
        x=x,
        y=y,
        mode="lines",
        name=name,
        connectgaps=False
    )
)

See, easy! Problem is, this doesn't work quite as I had hoped: Since x and y are series with corresponding coordinates, there are no gaps. When ploting time from, say, 150 to 155, I might have data points for x_p = [151, 152, 153, 155] with y-values y_p = [12, 14, 15, 16]. See, there are two gaps: one from the left-hand side and one between 153 and 155. The left-hand-side gap isn't drawn, and that's fine. The gap between 153 and 155 is drawn, since the line-drawer connects all points.

For plotly, the data would have to look like this:

x = [150, 151, 152, 153, 154, 155]
y = [None, 12, 14, 15, None, 16]

So this is my problem: given a grid and a series of points with values on that grid, how do I create an array like y above, that has a value for each point in the grid if such a value exists.

Solution 1

My good friend Plantprogrammer suggested I use numpy.where. It took us a while to figure out how to do that, but here's what we came up with:

y = np.full(x.shape, np.nan)

indices = np.where(np.in1d(x,x_p))   # <---- this is the important part

y[indices] = y_p

Well, actually, reading the documentation again and a bit more careful this time, the documentation recommends we use nonzero instead of whereWhy does this example work?
In a previous version of the documentation on where, there's this clause: "numpy.where(condition[, x, y]) Return elements, either from x or y, depending on condition. [...] If only condition is given, return the tuple condition.nonzero(), the indices where condition is True."
That means, if we pass into where only a condition, instead of selecting between two arrays, it'll autofill an index array and an array of NaNs. Which is exactly what we need.
I don't know why I had that version open. .

y = np.full(x.shape, np.nan)

indices = np.asarray(np.in1d(x,x_p)).nonzero()

y[indices] = y_p

Why does this work? Well, nonzero. documentation for nonzero: https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html does this: "Return the indices of the elements that are non-zero.". Which is perfect for our case: we're looking for the indices that fulfill in1d(x, x_p), ie. is this value in the grid.

So that works.

Solution 2

Instead of doing a where/nonzero, the fine folks at StackOverflow suggested another numpy verb: searchsorted.documentation for searchsorted: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html Here's what that does: "Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved." Not super-clear what that means, but for each value in your array x_p it gives you the index of the element that is just larger than the element. However, the input array must be sorted.

Well, that's perfect for our case, isn't it? We have a sorted grid, we have points exactly on the grid, so doing a searchsorted will give us the exact index positions of the values on the grid.

The code looks like this:

y = np.full(x.shape, np.nan)

indices = np.searchsorted(x, x_p)

y[indices] = y_p

And that works, too.

More problem

A person that has one clock always knows the time. A person that has two clocks never knows the time.

And now we have two solutions, but which one is better? Well, only one way to find out: we try it.

I generated some test data with a decent size and ran %timeit on each of the versions.

delta = end_date - begin_date
seconds = delta.seconds + delta.days * 86400

# one-minute grid of datetimes, in case you didn't immediately see that!
x = np.datetime64(begin_date) + np.arange(0, seconds, 60)

indices = ["lots", "of", "indices"]
x_p = x[indices]

print(x.shape, x_p.shape)

print("searchsorted")
%timeit np.searchsorted(x, x_p)
print("where")
%timeit np.where(np.in1d(x,x_p))
print("nonzero")
%timeit np.asarray(np.in1d(x,x_p)).nonzero()

And here's the output:

(10140,) (3716,)
searchsorted
56.2 µs ± 3.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
where
410 µs ± 4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
nonzero
401 µs ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And by the way, %timeit is one of those fantastic little tools that you want to use all the time once you know what it does. The macro will take a snippet of code and run it multiple times, trying to get a good timing on an average execution. Faster code will be executed more often, slower code less often, maybe only once. Read more in the documentation on timeit the Python standard library module and the documentation in %timeit the ipython macro

That's a pretty clear winner: searchsorted is about seven times faster. And of course it is, both where and nonzero have to search through x for each element in x_p while searchsorted only has to iterate through x once.

We could have assumed that before, but now we know.

This post was co-authored by Plantprogrammer