Problem 300: Protein Folding
View on Project EulerProject Euler Problem 300 Solution
EulerSolve provides an optimized solution for Project Euler Problem 300, Protein Folding, with C++, Python, Java, and a step-by-step mathematical explanation.
Problem Summary For every binary HP sequence of length \(n\), we place the chain on the square lattice as a self-avoiding walk and ask for the maximum possible number of H-H contacts. Project Euler asks for the average of that maximum over all \(2^n\) sequences. The code solves the case $n=15.$ Mathematical Approach 1. Folding shapes are self-avoiding walks A fold is a self-avoiding walk $p_0,p_1,\dots,p_{n-1}\in\mathbb Z^2$ with unit steps and no repeated lattice site. Because translation and rotation do not change contact counts, the implementation fixes $p_0=(0,0),\qquad p_1=(1,0),$ and then performs a DFS over the remaining \(n-2\) steps. This removes the large symmetry factor before enumeration even starts. 2. A fold matters only through its contact map For a finished walk, the only geometric information relevant to scoring is which non-consecutive monomer indices end up adjacent on the lattice. For \(i<j\) with \(j\ge i+2\), define $C_{ij}=1 \iff |p_i-p_j|_1=1.$ This produces a contact map, which the code stores as a bitset over all pairs \((i,j)\) with \(j\ge i+2\). Different lattice walks can produce the same contact map, and then they have exactly the same score for every HP sequence. So the geometry phase ends by deduplicating maps with hashing....
Detailed mathematical approach
Problem Summary
For every binary HP sequence of length \(n\), we place the chain on the square lattice as a self-avoiding walk and ask for the maximum possible number of H-H contacts. Project Euler asks for the average of that maximum over all \(2^n\) sequences. The code solves the case
$n=15.$
Mathematical Approach
1. Folding shapes are self-avoiding walks
A fold is a self-avoiding walk
$p_0,p_1,\dots,p_{n-1}\in\mathbb Z^2$
with unit steps and no repeated lattice site. Because translation and rotation do not change contact counts, the implementation fixes
$p_0=(0,0),\qquad p_1=(1,0),$
and then performs a DFS over the remaining \(n-2\) steps. This removes the large symmetry factor before enumeration even starts.
2. A fold matters only through its contact map
For a finished walk, the only geometric information relevant to scoring is which non-consecutive monomer indices end up adjacent on the lattice. For \(i<j\) with \(j\ge i+2\), define
$C_{ij}=1 \iff |p_i-p_j|_1=1.$
This produces a contact map, which the code stores as a bitset over all pairs \((i,j)\) with \(j\ge i+2\). Different lattice walks can produce the same contact map, and then they have exactly the same score for every HP sequence. So the geometry phase ends by deduplicating maps with hashing.
For example, the solver finds only
$41$
distinct contact maps for \(n=8\), and
$12495$
distinct contact maps for \(n=15\).
3. Why consecutive H-H pairs are added separately
If positions \(i\) and \(i+1\) are both H, then they are always adjacent in the chain, no matter how the fold bends. Therefore that contribution is independent of the contact map. For an HP mask \(h\in\{0,1\}^n\), the number of consecutive H-H bonds is
$\operatorname{popcount}(h\ \&\ (h\gg 1)).$
This explains the extra term in the code:
$\text{total}(h,C)=\text{nonconsecutive\_contacts}(h,C)+\operatorname{popcount}(h\ \&\ (h\gg 1)).$
So the optimization over folds only needs to maximize the non-consecutive part.
4. Scoring one contact map by subset DP
Fix one contact map \(C\). The code converts it to adjacency bitmasks \(N_C(i)\), where bit \(j\) is set if \((i,j)\) is a non-consecutive contact in that fold. For an H-mask \(m\), define \(\text{score}_C(m)\) as the number of contact edges whose two endpoints are both H.
Instead of checking every contact edge from scratch for every mask, the implementation uses the recurrence
$\text{score}_C(m)=\text{score}_C(m\setminus\{i\})+\bigl|N_C(i)\cap(m\setminus\{i\})\bigr|,$
where \(i\) is the least significant set bit of \(m\). This works because when \(i\) is inserted last, exactly the contacts from \(i\) to already-present H positions are new. Every contact is counted once and only once.
5. Best fold for each HP sequence
Let \(B(m)\) be the best non-consecutive contact count over all unique maps:
$B(m)=\max_C \text{score}_C(m).$
The code maintains an array best[mask] and updates it while scanning all deduplicated contact maps. After that, the full optimal score for sequence mask \(m\) is
$B(m)+\operatorname{popcount}(m\ \&\ (m\gg 1)).$
6. Exact averaging over all \(2^n\) sequences
The final numerator is
$\sum_{m=0}^{2^n-1}\left(B(m)+\operatorname{popcount}(m\ \&\ (m\gg 1))\right).$
Since every HP sequence is equally likely, the desired expectation is
$\frac{1}{2^n}\sum_{m=0}^{2^n-1}\left(B(m)+\operatorname{popcount}(m\ \&\ (m\gg 1))\right).$
The denominator is a power of two, so the result is converted to an exact terminating binary-rational decimal by long division.
7. Checkpoints
The published checkpoint is
$n=8 \quad \Longrightarrow \quad \frac{850}{256}=3.3203125.$
The full solver for \(n=15\) returns
$8.0540771484375.$
How the Code Works
dfs(...) enumerates all self-avoiding walks after fixing the first edge. add_current_contact_map() converts one finished walk to a packed bit key and inserts it into a set. Then each unique map is turned into adjacency masks, and a subset DP computes \(\text{score}_C(m)\) for all \(m\). The array best stores the maximum over all maps, after which the sequence-independent consecutive-HH term is added and averaged.
Complexity Analysis
If \(M\) is the number of unique contact maps, the scoring phase costs
$O(M\cdot 2^n)$
with very small constants because every update is just a few bit operations. Memory is
$O(M+2^n).$
The expensive combinatorial step is the self-avoiding walk enumeration, but for \(n=15\) it is still manageable once symmetry is fixed and maps are deduplicated.
Further Reading
- Problem page: https://projecteuler.net/problem=300
- Self-avoiding walk: https://en.wikipedia.org/wiki/Self-avoiding_walk
- HP lattice model: https://en.wikipedia.org/wiki/HP_model