# Numpy: Searching indices

I was having a problem.

# The Problem

I want to generate nice-looking diagrams of several time series. These time series can have gaps, and the gaps should not be connected, but the lines without gaps *should* be connected, obviously. Fortunately, plotly makes it very easy to create such a line:

```
fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="lines",
name=name,
connectgaps=False
)
)
```

See, easy! Problem is, this doesn’t work quite as I had hoped: Since `x`

and `y`

are series with corresponding coordinates, there are no gaps. When ploting time from, say, 150 to 155, I might have data points for `x_p = [151, 152, 153, 155]`

with y-values `y_p = [12, 14, 15, 16]`

. See, there are two gaps: one from the left-hand side and one between 153 and 155. The left-hand-side gap isn’t drawn, and that’s fine. The gap between 153 and 155 *is* drawn, since the line-drawer connects all points.

For plotly, the data would have to look like this:

```
x = [150, 151, 152, 153, 154, 155]
y = [None, 12, 14, 15, None, 16]
```

So this is my problem: given a grid and a series of points with values on that grid, how do I create an array like `y`

above, that has a value for each point in the grid if such a value exists.

# Solution 1

My good friend Plantprogrammer suggested I use `numpy.where`

. It took us a while to figure out how to do that, but here’s what we came up with:

```
y = np.full(x.shape, np.nan)
indices = np.where(np.in1d(x,x_p)) # <---- this is the important part
y[indices] = y_p
```

Well, actually, reading the documentation again and a bit more careful this time, the documentation recommends we use `nonzero`

instead of `where`

Why does this example work?

In a previous version of the documentation on `where`

, there’s this clause: “`numpy.where(condition[, x, y])`

Return elements, either from x or y, depending on condition. […] If only condition is given, return the tuple condition.nonzero(), the indices where condition is True.”

That means, if we pass into `where`

only a condition, instead of selecting between two arrays, it’ll autofill an index array and an array of NaNs. Which is exactly what we need.

I don’t know why I had that version open.
.

```
y = np.full(x.shape, np.nan)
indices = np.asarray(np.in1d(x,x_p)).nonzero()
y[indices] = y_p
```

Why does this work? Well, `nonzero`

.
documentation for `nonzero`

: https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html does this: “Return the indices of the elements that are non-zero.”. Which is perfect for our case: we’re looking for the indices that fulfill `in1d(x, x_p)`

, ie. is this value in the grid.

So that works.

# Solution 2

Instead of doing a `where`

/`nonzero`

, the fine folks at StackOverflow suggested another numpy verb: `searchsorted`

.documentation for `searchsorted`

: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html Here’s what that does: “Find the indices into a sorted array `a`

such that, if the corresponding elements in `v`

were inserted before the indices, the order of `a`

would be preserved.” Not super-clear what that means, but for each value in your array `x_p`

it gives you the index of the element that is *just* larger than the element. However, the input array must be sorted.

Well, that’s perfect for our case, isn’t it? We have a sorted grid, we have points exactly on the grid, so doing a `searchsorted`

will give us the exact index positions of the values on the grid.

The code looks like this:

```
y = np.full(x.shape, np.nan)
indices = np.searchsorted(x, x_p)
y[indices] = y_p
```

And that works, too.

# More problem

A person that has one clock always knows the time. A person that has two clocks never knows the time.

And now we have two solutions, but which one is better? Well, only one way to find out: we try it.

I generated some test data with a decent size and ran `%timeit`

on each of the versions.

```
delta = end_date - begin_date
seconds = delta.seconds + delta.days * 86400
# one-minute grid of datetimes, in case you didn't immediately see that!
x = np.datetime64(begin_date) + np.arange(0, seconds, 60)
indices = ["lots", "of", "indices"]
x_p = x[indices]
print(x.shape, x_p.shape)
print("searchsorted")
%timeit np.searchsorted(x, x_p)
print("where")
%timeit np.where(np.in1d(x,x_p))
print("nonzero")
%timeit np.asarray(np.in1d(x,x_p)).nonzero()
```

And here’s the output:

```
(10140,) (3716,)
searchsorted
56.2 µs ± 3.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
where
410 µs ± 4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
nonzero
401 µs ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

And by the way, `%timeit`

is one of those fantastic little tools that you want to use all the time once you know what it does. The macro will take a snippet of code and run it multiple times, trying to get a good timing on an average execution. Faster code will be executed more often, slower code less often, maybe only once. Read more in the documentation on `timeit`

the Python standard library module and the documentation in `%timeit`

the ipython macro

That’s a pretty clear winner: `searchsorted`

is about seven times faster. And of course it is, both `where`

and `nonzero`

have to search through `x`

for each element in `x_p`

while `searchsorted`

only has to iterate through `x`

once.

We could have assumed that before, but now we know.

This post was co-authored by Plantprogrammer