Linear max gap

December 1, 2022
Yang Yan

A friend Sanjeev introduced the following problem over Thanksgiving weekend:

Given a list of unsorted integers A1,,ANA_1,\ldots,A_N, find in O(N)O(N) the maximum difference between consecutive integers of the sorted list.

To invalidate a naive bucket and radix sort, assume each AiA_i is up to 4N4^N, such that a radix sort runs in O(NlnN)O(N\ln N) and the bucket sort in O(4N)O(4^N). However, with some suspension of belief, assume that other basic arithmetic and comparisons are constant time.
Anyway, I detail my approach below.

Iterative ideation

I began the problem thinking through iterative solutions, and this came to mind:

For each prefix A1,,AiA_1,\ldots,A_i, track the max gap GG for integers of that prefix, alongside the Amin,AmaxA_{min},A_{max}. For Ai+1A_{i+1}, if it falls before AminA_{min} or after AmaxA_{max}, potentially update GG if either AminAi+1A_{min}-A_{i+1} or AmaxAi+1A_{max}-A_{i+1} are greater. Otherwise, the max gap GG cannot increase.

Of course, it’s easy to spot the error here: should AminAi+1AmaxA_{min}\leq A_{i+1}\leq A_{max}, GG cannot increase—but it can decrease. Unfortunately, we don’t have the information to determine whether or not this happens.

Intuitively, we need to have information about all other gaps of the prefix, to determine if GG decreases with Ai+1A_{i+1}. Looking back, most iterative solutions are some form of DP, and for O(N)O(N) iterations, we need to keep at most O(1)O(1) information per state. Then, DP approaches are probably impossible.

Square root decomposition

This is a fairly common CP technique, most commonly in Mo’s algorithm-like scenarios. I’m not too confident about this approach, though, because the final runtime usually involves a N\sqrt{N} factor.

The core intuition behind Mo’s algorithm is that each of the N\sqrt{N} buckets cover a range of M\sqrt{M} rather than MM, and whatever updates need to happen are linear to the range. So processing a bucket ends up being O(NM)O(\sqrt{NM}), with all bucket processing being O(NM)O(N\sqrt{M}), which usually eats the runtime of resetting buckets O(N)O(\sqrt{N}) times.

The problem here is that processing a number is hardly dependent on the range of the number—it is dependent on the number of numbers already processed, in that bucket. I fail to see a solution to this.

Back to iterative

I make the mistake of jumping back into DP at this time. We get the initial gap G=Asecond minAminG=A_{\text{second min}}-A_{min} for free. We can then partition the range M=AmaxAminM=A_{max}-A_{min} of the numbers into buckets of size GG. Numbers within the same bucket don’t need to be considered, and provided a reasonable distribution of AiA_i, we expect around O(N)O(N) buckets. This means we can hash the numbers into buckets in O(N)O(N) time. The assumption on the distribution is a red herring we may be able to solve later with randomization.

I also make the mistake of jumping back into iterative procedures. I’m not sure where I was going with the following proposal—there’s a lot wrong with it.

Given the current max gap GG, and a current left pivot AlpA_{lp}, assume that the max gap is either GG or greater, or occurs to the right AlpA_{lp}. To begin, swap AminA_{min} to the beginning of the array and set it as left pivot, with GG as the initial gap Asecond minAminA_{\text{second min}}-A_{min}.

For each new number Ai+1A_{i+1}, if Ai+1Alp+GA_{i+1}\geq A_{lp}+G, hash Ai+1A_{i+1} to its bucket. Otherwise, Ai+1A_{i+1} is close to AlpA_{lp}, and the max gap cannot occur between AlpA_{lp} and Ai+1A_{i+1} by our assumptions. So, update AlpA_{lp} to Ai+1A_{i+1}. Updates may need to be cascaded—after one update to AlpA_{lp}, consult the series of buckets up to and including Alp+GA_{lp}+G and update AlpA_{lp} to the largest number in those buckets.

I think the update cascade procedure may be correct—we cascade through each AiA_i at most O(1)O(1) times, and only move AlpA_{lp} if we’re sure that numbers before the new AlpA_{lp} cannot increase the max gap. However, among other things, we run into a problem after we’ve cascaded—it is now possible that Ai+1A_{i+1} occurs before AlpA_{lp} and in fact invalidates our GG and decreases the max gap—which is the same core problem we ran into before.

Intuitively, this make a lot of sense, and it is my error for not noticing it earlier. Again, we attempt to keep a constant amount of state per iteration, and again we are brought down by the possibility that we need to access a O(N)O(N) history of max gaps in a certain iteration.

Buckets

At this point I’m taking notice of the nice property of buckets, especially when we have O(N)O(N) of them and can hash into them for free. In considering the expected initial gap for the previous approach, I also notice that the max gap is very rarely equal to M/NM/N, and is usually much larger. We can determine if the max gap is equal to this by simply partitioning MM into NN buckets: if each bucket has one number, we are done, as we’ve essentially bucket sorted. otherwise, the max gap is greater than M/NM/N.

In fact, we cannot have the max gap occur within any single bucket. So we don’t need all the numbers within every bucket, only two numbers for the min/max of the bucket. As the max gap cannot occur within a bucket, we need only check the max of each bucket with the min of the next bucket. So, we are done:

Partition the range K=AmaxAminK=A_{max}-A_{min} into NN buckets, and hash in O(N)O(N) each number into its bucket. Track the min/max Bi,CiB_i,C_i of each bucket and simply take the max of all Bi+1CiB_{i+1}-C_i and CiBiC_i-B_i as the max gap.

Reflection

The final solution is surprisingly simple and I am ashamed it took me so long to get it. I took a long twenty-minute detour the second time I considered iterative solutions, thought it did bring me marginally closer to the bucket solution. In hindsight, it should be immediately obvious based on the state information argument that any naive iterative solution is bound to fail.

Square root decomposition should have gotten me closer than it did. Any bucketing approach—both my solution and Mo’s algorithm—relies on the fact that buckets reduce the range of the problem. Though range doesn’t factor into the runtime here, range limits which numbers are relevant, and immediately decreases the state space of each bucket to constant.

In fact, I think bucketing solutions are likely quite flexible (I see them around a lot, e.g. in Bloom filters and VeB trees, if I recall correctly). The core intuition is that either the runtime or information state of a subproblem is tied to its range—and buckets reduce that range, essentially for free most of the time.


This problem took me around 40 minutes of focused airplane time to solve, which is a bit longer than I’d hope, considering the simplicity of the solution. I’ve been reflecting on this a bit—it’s not quite pride, nor honor, nor interview preparation, nor an interest in research, that seems to push me back to math and algorithms every so often. A mentor Richard once said to me:

Suppose they put me in a room, and each time I solved the problem on screen, a green check mark would light up (this part is very important), and they would give me a bit of food. I would be pretty happy.

Well, Richard is a PhD (or postdoc (or professor?)) now. And I still can’t say I’ve found very many things that make me happier than those green check marks. I think, at some level, I’d much rather prefer working in the space of well-defined technical problems, than spaces which seem to smell like nepotism, privilege, appearance, and chance. But these are big words, and these spaces are big spaces whose boundaries are at least quite differentiable. I think I just want to feel in-control of my own destiny, and this problem made me feel that way.


I’ve been playing around with Github Copilot recently on recommendation of a friend Tony, and found it to be surprisingly compentent, but ultimately useless, while writing this post:


Close, Copilot.


Sanjeev also mentions a variant of this problem, but for the min gap, which may require a randomized solution. I’ll tackle this later.