Problem 691: Long Substring with Many Repetitions
View on Project EulerProject Euler Problem 691 Solution
EulerSolve provides an optimized solution for Project Euler Problem 691, Long Substring with Many Repetitions, with C++, Python, Java, and a step-by-step mathematical explanation.
Problem Summary For \(0 \le i \lt n\), define $a_i=\operatorname{popcount}(i)\bmod 2,\qquad b_i=\left\lfloor\frac{i+1}{\varphi}\right\rfloor-\left\lfloor\frac{i}{\varphi}\right\rfloor,\qquad c_i=a_i\oplus b_i,$ where \(\varphi=\frac{1+\sqrt{5}}{2}\). The binary word of interest is $C_n=c_0c_1\dots c_{n-1}.$ For each \(k\ge 1\), let \(L_n(k)\) be the maximum length of a substring of \(C_n\) that occurs at least \(k\) times, allowing overlaps. The task is to evaluate $\sum_{\substack{k\ge 1\\L_n(k)>0}} L_n(k)$ for \(n=5{,}000{,}000\). Since \(C_n\) has millions of positions and \(O(n^2)\) distinct substring intervals, the solution must compress the repeated-substring structure instead of enumerating substrings directly. Mathematical Approach Once the word \(C_n\) has been generated, the problem becomes a pure substring-frequency question for one fixed binary string. The decisive tool is a suffix automaton, because it stores all distinct substrings in linear space together with enough structure to recover how many times each substring appears. Step 1: Build the Binary Word The first ingredient, \(a_i\), is the parity of the number of \(1\)-bits in the binary expansion of \(i\), so it is the Thue-Morse bit at index \(i\). The second ingredient, \(b_i\), is a Beatty difference....
Detailed mathematical approach
Problem Summary
For \(0 \le i \lt n\), define
$a_i=\operatorname{popcount}(i)\bmod 2,\qquad b_i=\left\lfloor\frac{i+1}{\varphi}\right\rfloor-\left\lfloor\frac{i}{\varphi}\right\rfloor,\qquad c_i=a_i\oplus b_i,$
where \(\varphi=\frac{1+\sqrt{5}}{2}\). The binary word of interest is
$C_n=c_0c_1\dots c_{n-1}.$
For each \(k\ge 1\), let \(L_n(k)\) be the maximum length of a substring of \(C_n\) that occurs at least \(k\) times, allowing overlaps. The task is to evaluate
$\sum_{\substack{k\ge 1\\L_n(k)>0}} L_n(k)$
for \(n=5{,}000{,}000\). Since \(C_n\) has millions of positions and \(O(n^2)\) distinct substring intervals, the solution must compress the repeated-substring structure instead of enumerating substrings directly.
Mathematical Approach
Once the word \(C_n\) has been generated, the problem becomes a pure substring-frequency question for one fixed binary string. The decisive tool is a suffix automaton, because it stores all distinct substrings in linear space together with enough structure to recover how many times each substring appears.
Step 1: Build the Binary Word
The first ingredient, \(a_i\), is the parity of the number of \(1\)-bits in the binary expansion of \(i\), so it is the Thue-Morse bit at index \(i\). The second ingredient, \(b_i\), is a Beatty difference. Since \(\frac{1}{\varphi}=\varphi-1\in(0,1)\), the sequence
$\left\lfloor\frac{i}{\varphi}\right\rfloor$
can increase by at most \(1\) when \(i\) is incremented, hence
$b_i\in\{0,1\}.$
Therefore \(c_i=a_i\oplus b_i\) is again binary, and the entire input to the substring problem is a binary word \(C_n\). Generating the word costs only \(O(n)\) time, one position at a time.
Step 2: Compress All Substrings with a Suffix Automaton
A suffix automaton for \(C_n\) groups substrings by their set of end positions. Each state \(v\) represents an equivalence class of substrings that all end in exactly the same positions in \(C_n\). Let \(M(v)\) be the maximum length represented by state \(v\). If the suffix link of \(v\) points to \(u\), then the represented lengths form the interval
$M(u)+1,\ M(u)+2,\ \dots,\ M(v).$
So every distinct substring belongs to exactly one state, and within that state the longest candidate has length \(M(v)\). For a string of length \(n\), a suffix automaton has at most \(2n-1\) states, which is why the whole repeated-substring structure can be stored in linear size.
Step 3: Recover the Number of Occurrences of Each State
While the automaton is built online from left to right, each newly created non-clone state corresponds to one new suffix ending at the current position, so it starts with one end position. Clone states start from zero because they only split structure; they do not introduce a new occurrence by themselves.
After the full word has been inserted, the states are processed in decreasing order of \(M(v)\). When a state \(v\) contributes its total to its suffix-link parent, all longer substrings have already been counted, so the propagated value becomes the size of the end-position set for that entire class. Denoting this final count by \(R(v)\), every substring represented by \(v\) occurs exactly \(R(v)\) times in \(C_n\).
This count automatically includes overlapping repetitions, because different end positions are counted separately even if the corresponding substring occurrences overlap in the word.
Step 4: Convert State Data into \(L_n(k)\)
Fix a state \(v\). Every substring represented by \(v\) occurs exactly \(R(v)\) times, but those substrings have several possible lengths inside the interval described above. For the quantity \(L_n(k)\), only the longest one matters, because among all substrings with the same repetition count we want the maximum possible length. Therefore state \(v\) contributes the candidate length \(M(v)\) to the bucket for the exact repetition count \(R(v)\).
If \(B(t)\) denotes the largest substring length seen among states with exact occurrence count \(t\), we update
$B(R(v))=\max\bigl(B(R(v)),M(v)\bigr).$
But the problem asks for “at least \(k\) times”, not “exactly \(k\) times”. So we transform exact counts into lower bounds by a suffix maximum:
$L_n(k)=\max_{t\ge k} B(t).$
A single pass from right to left over the array of counts computes all values \(L_n(1),L_n(2),\dots,L_n(n)\).
Step 5: Sum the Positive Values
Once the entire array \(L_n(k)\) is known, the required answer is simply
$S_n=\sum_{\substack{k\ge 1\\L_n(k)>0}} L_n(k).$
The sequence \(L_n(k)\) is non-increasing in \(k\): requiring more repetitions can never make the best repeated substring longer. Hence after the first zero appears, all later terms are also zero, so the final accumulation is straightforward.
Worked Example: \(n=10\)
For \(i=0,1,\dots,9\), the two source sequences are
$a_i=(0,1,1,0,1,0,0,1,1,0),$
$b_i=(0,1,0,1,1,0,1,0,1,1).$
Taking bitwise XOR gives
$C_{10}=0011001101.$
Now inspect repeated substrings:
\(L_{10}(1)=10\) because the whole word appears once. The substring \(00110\) appears twice, at positions \(1\) through \(5\) and \(5\) through \(9\) in one-based indexing, so \(L_{10}(2)=5\). The substring \(01\) appears three times, and no substring of length \(3\) appears three times, so \(L_{10}(3)=2\). Finally, single symbols appear five times each, giving \(L_{10}(4)=L_{10}(5)=1\), and for \(k\ge 6\) the value is \(0\).
Therefore
$L_{10}(k)=(10,5,2,1,1,0,\dots),$
and the sum of positive terms is
$10+5+2+1+1=19.$
This agrees with the small checkpoints used by the implementation, in particular \(L_{10}(2)=5\) and \(L_{10}(3)=2\).
How the Code Works
The C++, Python, and Java implementations generate the bits of \(C_n\) on the fly, so no separate quadratic substring table is ever materialized. Each new bit extends a suffix automaton over the binary alphabet \(\{0,1\}\); because the alphabet has size two, every state needs only two outgoing transitions, one suffix link, one maximum-length field, and one occurrence counter.
After the automaton is complete, the implementation counting-sorts states by their represented maximum length, then walks that order from longest to shortest so occurrence totals can be pushed through suffix links. Next it records, for each exact occurrence count, the largest maximum length attained by any state with that count. A backward suffix-maximum pass converts those exact-count buckets into the desired values \(L_n(k)\). The last step sums the positive entries and stops once they become zero.
Complexity Analysis
Let \(n\) be the length of the word. Building the binary sequence and extending the suffix automaton are both linear, and the automaton contains at most \(2n-1\) states. Sorting states by length with counting sort is \(O(n)\), propagating occurrence counts is \(O(n)\), and the final exact-count and suffix-maximum passes are also \(O(n)\). Thus the total running time is \(O(n)\), and the memory usage is \(O(n)\).
Footnotes and References
- Problem page: https://projecteuler.net/problem=691
- Suffix automaton overview: cp-algorithms — Suffix Automaton
- Beatty sequence: Wikipedia — Beatty sequence
- Thue-Morse sequence: Wikipedia — Thue-Morse sequence