Problem 361: Subsequence of Thue-Morse Sequence
View on Project EulerProject Euler Problem 361 Solution
EulerSolve provides an optimized solution for Project Euler Problem 361, Subsequence of Thue-Morse Sequence, with C++, Python, Java, and a step-by-step mathematical explanation.
Problem Summary Let \(t=0110100110010110\ldots\) be the infinite Thue-Morse word, the fixed point of the morphism \(\mu:0\mapsto01,\;1\mapsto10\). For each length \(\ell\), let \(T_\ell\) be the set of distinct factors of \(t\) of length \(\ell\). Sort \(T_\ell\) lexicographically, keep only the words whose first bit is \(1\), and then concatenate these filtered lists for \(\ell=1,2,3,\ldots\). Reading every chosen word as a binary integer produces the sequence \(A(1),A(2),\ldots\). The goal is to compute $\sum_{k=1}^{18} A(10^k) \pmod{10^9}.$ Mathematical Approach 1. Count the Number of Factors of Each Length Write $p(\ell)=|T_\ell|.$ The solution files use the standard Thue-Morse factor-complexity recurrence with the explicit base values $p(1)=2,\qquad p(2)=4,\qquad p(3)=6,$ and for \(n\ge 2\), $p(2n)=p(n)+p(n+1),\qquad p(2n+1)=2p(n+1).$ Now define the cumulative count $s(\ell)=\sum_{j=1}^{\ell} p(j).$ The code memoizes the derived recurrences $s(2n)=3s(n)+s(n+1)-8,\qquad s(2n+1)=4s(n)+3p(n+1)-8.$ Why does this help with the Project Euler sequence? Because the Thue-Morse word is closed under bitwise complement: if \(w\in T_\ell\), then \(\overline{w}\in T_\ell\). No nonempty binary word equals its own complement, so the factors of length \(\ell\) are paired into one word starting with \(0\) and one starting with \(1\)....
Detailed mathematical approach
Problem Summary
Let \(t=0110100110010110\ldots\) be the infinite Thue-Morse word, the fixed point of the morphism \(\mu:0\mapsto01,\;1\mapsto10\). For each length \(\ell\), let \(T_\ell\) be the set of distinct factors of \(t\) of length \(\ell\). Sort \(T_\ell\) lexicographically, keep only the words whose first bit is \(1\), and then concatenate these filtered lists for \(\ell=1,2,3,\ldots\). Reading every chosen word as a binary integer produces the sequence \(A(1),A(2),\ldots\). The goal is to compute
$\sum_{k=1}^{18} A(10^k) \pmod{10^9}.$
Mathematical Approach
1. Count the Number of Factors of Each Length
Write
$p(\ell)=|T_\ell|.$
The solution files use the standard Thue-Morse factor-complexity recurrence with the explicit base values
$p(1)=2,\qquad p(2)=4,\qquad p(3)=6,$
and for \(n\ge 2\),
$p(2n)=p(n)+p(n+1),\qquad p(2n+1)=2p(n+1).$
Now define the cumulative count
$s(\ell)=\sum_{j=1}^{\ell} p(j).$
The code memoizes the derived recurrences
$s(2n)=3s(n)+s(n+1)-8,\qquad s(2n+1)=4s(n)+3p(n+1)-8.$
Why does this help with the Project Euler sequence? Because the Thue-Morse word is closed under bitwise complement: if \(w\in T_\ell\), then \(\overline{w}\in T_\ell\). No nonempty binary word equals its own complement, so the factors of length \(\ell\) are paired into one word starting with \(0\) and one starting with \(1\). Hence exactly half of them start with \(1\), and the number of valid Euler terms up to length \(\ell\) is
$\operatorname{count\_upto}(\ell)=\frac{s(\ell)}{2}.$
2. Convert a Global Index into a Length and a Local Rank
Given \(N\), the program first finds the smallest length \(\ell\) such that
$\frac{s(\ell)}{2}\ge N.$
This is done by doubling an upper bound and then applying binary search. Once \(\ell\) is known, the rank of the desired word inside the filtered list for this fixed length is
$k=N-\operatorname{count\_upto}(\ell-1).$
Inside the full lexicographic list of all words in \(T_\ell\), the leading-\(0\) half comes first and the leading-\(1\) half comes second. Therefore the wanted factor is the
$\frac{p(\ell)}{2}+k$
th factor of length \(\ell\) in complete lexicographic order. This is the reason for the line
$\texttt{offset}=p(\ell)/2+k$
in all three implementations.
3. Reconstruct the Lexicographic Factor Order Recursively
The difficult part is selecting the \(\texttt{offset}\)-th factor of length \(\ell\) without listing all factors. The implementations stop recursion at lengths \(1,2,3\), where the factor sets are stored explicitly:
$T_1=\{0,1\},\qquad T_2=\{00,01,10,11\},\qquad T_3=\{001,010,011,100,101,110\}.$
For \(\ell\ge 4\), the solution uses the unique reading frame induced by the length-2 morphism \(\mu\). Every factor either starts on a block boundary of \(\mu\) or one symbol later, and that leads to three lexicographic classes called \(O1\), \(E\), and \(O0\) in the source code.
4. Even Length Factors
Let \(\ell=2n\). If a factor starts on an even boundary, it is exactly \(\mu(u)\) for a unique \(u\in T_n\). This gives the middle class
$E(u)=\mu(u),\qquad u\in T_n,$
with size \(p(n)\).
If the factor starts one position later, it begins inside one \(\mu\)-block and ends inside another. For an ancestor \(a=a_1a_2\cdots a_{n+1}\in T_{n+1}\), the odd-start construction used by the code is
$\operatorname{OE}(a)=(1-a_1)\,\mu(a_2a_3\cdots a_n)\,a_{n+1}.$
If \(a_1=1\), the result starts with \(0\); if \(a_1=0\), the result starts with \(1\). Thus the complete lexicographic order for length \(2n\) splits into three consecutive blocks:
$O1,\qquad E,\qquad O0,$
with sizes
$\frac{p(n+1)}{2},\qquad p(n),\qquad \frac{p(n+1)}{2}.$
A concrete example is \(\ell=4\), where the ten factors are
$0010,0011,0100 \mid 0101,0110,1001,1010 \mid 1011,1100,1101.$
The first block is \(O1\), the middle block is \(E\), and the last block is \(O0\).
5. Odd Length Factors
Let \(\ell=2n+1\). There are again two alignment types, but now an odd-start factor crosses only the left block boundary, while an even-start factor crosses only the right one. For \(a=a_1a_2\cdots a_{n+1}\in T_{n+1}\), the code uses
$\operatorname{OO}(a)=(1-a_1)\,\mu(a_2a_3\cdots a_{n+1}),$
$\operatorname{EO}(a)=\mu(a_1a_2\cdots a_n)\,a_{n+1}.$
Exactly as in the even case, the lexicographic order is
$O1,\qquad E,\qquad O0,$
but now the block sizes are
$\frac{p(n+1)}{2},\qquad p(n+1),\qquad \frac{p(n+1)}{2}.$
This is why the recursive selector only needs the values \(p(n)\) and \(p(n+1)\) to decide which branch contains the required factor.
6. Evaluate the Selected Word Modulo \(10^9\) Without Expanding It
The chosen factor can be very long, so the program never builds its ordinary binary integer directly. Instead it defines
$M=10^9,\qquad B_i=2^{2^i}\bmod M,$
and lets \(V_i(w)\) be the value of a binary word \(w\) when read in base \(B_i\). In particular, \(V_0(w)\) is exactly the ordinary binary value modulo \(M\).
Concatenation satisfies the standard rule
$V_i(uv)=V_i(u)\,B_i^{|v|}+V_i(v).$
For the morphism, one block \(\mu(c)\) has value
$V_i(\mu(c))=1+(B_i-1)c,\qquad c\in\{0,1\}.$
So for \(w=c_1c_2\cdots c_m\),
$V_i(\mu(w))=\sum_{j=1}^{m} B_i^{m-j}\bigl(1+(B_i-1)c_j\bigr).$
Separating the constant part from the digit-dependent part gives
$V_i(\mu(w))=G_{i+1}(m)+(B_i-1)V_{i+1}(w),$
where
$G_j(m)=\sum_{r=0}^{m-1} B_j^r.$
This identity is exactly what the cached MU node evaluation computes. The helper
pow_sum stores both \(B_i^m\) and \(G_i(m)\), so each evaluation step remains small and reusable.
How the Code Works
The C++, Python, and Java versions all follow the same structure. They memoize p_count and
s_count, use find_length_for_index to locate the correct length, and then call
select_node to build only a symbolic expression DAG for the requested factor. WordBuilder
has four node types: empty, leaf, concatenation, and morphism application. The helper methods
prefix, suffix, and middle implement the ancestor-to-descendant maps behind
the \(O1/E/O0\) split. Finally get_eval computes the cached values \(V_i\) for each node, and
compute_A_mod returns \(V_0\). The C++ file also contains sanity checks proving that the construction
reproduces \(A(100)=3251\) and \(A(1000)=80852364498\).
Complexity Analysis
Because every recurrence halves its argument, computing \(p(\ell)\) or \(s(\ell)\) touches only \(O(\log \ell)\)
memoized states. The length search for \(A(N)\) costs \(O(\log N)\) calls to count_upto. After that,
select_node descends through \(O(\log \ell)\) recursive levels, and the node evaluator works with a
fixed base table of size 64. So each query is handled in polylogarithmic time with \(O(\log \ell)\) symbolic
memory, which is dramatically smaller than generating all factors or building the full binary integer.
References
- Problem page: https://projecteuler.net/problem=361
- Thue-Morse sequence: Wikipedia — Thue-Morse sequence
- Subword complexity: Wikipedia — Subword complexity
- J.-P. Allouche and J. Shallit, Automatic Sequences, Cambridge University Press, 2003.