I was having a problem.

# The Problem

I want to generate nice-looking diagrams of several time series. These time series can have gaps, and the gaps should not be connected, but the lines without gaps should be connected, obviously. Fortunately, plotly makes it very easy to create such a line:

``````fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="lines",
name=name,
connectgaps=False
)
)
``````

See, easy! Problem is, this doesn't work quite as I had hoped: Since `x` and `y` are series with corresponding coordinates, there are no gaps. When ploting time from, say, 150 to 155, I might have data points for `x_p = [151, 152, 153, 155]` with y-values `y_p = [12, 14, 15, 16]`. See, there are two gaps: one from the left-hand side and one between 153 and 155. The left-hand-side gap isn't drawn, and that's fine. The gap between 153 and 155 is drawn, since the line-drawer connects all points.

For plotly, the data would have to look like this:

``````x = [150, 151, 152, 153, 154, 155]
y = [None, 12, 14, 15, None, 16]
``````

So this is my problem: given a grid and a series of points with values on that grid, how do I create an array like `y` above, that has a value for each point in the grid if such a value exists.

# Solution 1

My good friend Plantprogrammer suggested I use `numpy.where`. It took us a while to figure out how to do that, but here's what we came up with:

``````y = np.full(x.shape, np.nan)

indices = np.where(np.in1d(x,x_p))   # <---- this is the important part

y[indices] = y_p
``````

Well, actually, reading the documentation again and a bit more careful this time, the documentation recommends we use `nonzero` instead of `where`Why does this example work?
In a previous version of the documentation on `where`, there's this clause: "`numpy.where(condition[, x, y])` Return elements, either from x or y, depending on condition. [...] If only condition is given, return the tuple condition.nonzero(), the indices where condition is True."
That means, if we pass into `where` only a condition, instead of selecting between two arrays, it'll autofill an index array and an array of NaNs. Which is exactly what we need.
I don't know why I had that version open.
.

``````y = np.full(x.shape, np.nan)

indices = np.asarray(np.in1d(x,x_p)).nonzero()

y[indices] = y_p
``````

Why does this work? Well, `nonzero`. documentation for `nonzero`: https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html does this: "Return the indices of the elements that are non-zero.". Which is perfect for our case: we're looking for the indices that fulfill `in1d(x, x_p)`, ie. is this value in the grid.

So that works.

# Solution 2

Instead of doing a `where`/`nonzero`, the fine folks at StackOverflow suggested another numpy verb: `searchsorted`.documentation for `searchsorted`: https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html Here's what that does: "Find the indices into a sorted array `a` such that, if the corresponding elements in `v` were inserted before the indices, the order of `a` would be preserved." Not super-clear what that means, but for each value in your array `x_p` it gives you the index of the element that is just larger than the element. However, the input array must be sorted.

Well, that's perfect for our case, isn't it? We have a sorted grid, we have points exactly on the grid, so doing a `searchsorted` will give us the exact index positions of the values on the grid.

The code looks like this:

``````y = np.full(x.shape, np.nan)

indices = np.searchsorted(x, x_p)

y[indices] = y_p
``````

And that works, too.

# More problem

A person that has one clock always knows the time. A person that has two clocks never knows the time.

And now we have two solutions, but which one is better? Well, only one way to find out: we try it.

I generated some test data with a decent size and ran `%timeit` on each of the versions.

``````delta = end_date - begin_date
seconds = delta.seconds + delta.days * 86400

# one-minute grid of datetimes, in case you didn't immediately see that!
x = np.datetime64(begin_date) + np.arange(0, seconds, 60)

indices = ["lots", "of", "indices"]
x_p = x[indices]

print(x.shape, x_p.shape)

print("searchsorted")
%timeit np.searchsorted(x, x_p)
print("where")
%timeit np.where(np.in1d(x,x_p))
print("nonzero")
%timeit np.asarray(np.in1d(x,x_p)).nonzero()
``````

And here's the output:

``````(10140,) (3716,)
searchsorted
56.2 µs ± 3.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
where
410 µs ± 4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
nonzero
401 µs ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
``````

And by the way, `%timeit` is one of those fantastic little tools that you want to use all the time once you know what it does. The macro will take a snippet of code and run it multiple times, trying to get a good timing on an average execution. Faster code will be executed more often, slower code less often, maybe only once. Read more in the documentation on `timeit` the Python standard library module and the documentation in `%timeit` the ipython macro

That's a pretty clear winner: `searchsorted` is about seven times faster. And of course it is, both `where` and `nonzero` have to search through `x` for each element in `x_p` while `searchsorted` only has to iterate through `x` once.

We could have assumed that before, but now we know.

This post was co-authored by Plantprogrammer