Problem 521: Smallest Prime Factor
View on Project EulerProject Euler Problem 521 Solution
EulerSolve provides an optimized solution for Project Euler Problem 521, Smallest Prime Factor, with C++, Python, Java, and a step-by-step mathematical explanation.
Problem Summary For \(n\ge 2\), let \(s(n)\) be the smallest prime factor of \(n\). We must evaluate $S(N)=\sum_{n=2}^{N} s(n), \qquad N=10^{12},$ and return the result modulo \(10^9\). A direct sieve up to \(N\) would require linear memory and essentially linear time, so the implementation instead tracks how composite numbers disappear when primes are processed in increasing order. Mathematical Approach The key idea is that every composite number is removed exactly once, namely when its smallest prime factor is reached, while primes are never removed. The algorithm turns that observation into coupled recurrences for filtered counts and filtered sums. Step 1: Split the answer into prime and composite parts If \(n\) is prime, then \(s(n)=n\). If \(n\) is composite, then \(s(n)\le \sqrt n\le \sqrt N\). Therefore $S(N)=\sum_{\substack{q\le N\\ q\text{ prime}}} q+\sum_{p\le \sqrt N} p\,M_p(N),$ where \(M_p(N)\) denotes the number of composite integers \(n\le N\) whose smallest prime factor is exactly \(p\). So the entire task becomes: count those composites efficiently, and also recover the sum of all primes up to \(N\)....
Detailed mathematical approach
Problem Summary
For \(n\ge 2\), let \(s(n)\) be the smallest prime factor of \(n\). We must evaluate
$S(N)=\sum_{n=2}^{N} s(n), \qquad N=10^{12},$
and return the result modulo \(10^9\). A direct sieve up to \(N\) would require linear memory and essentially linear time, so the implementation instead tracks how composite numbers disappear when primes are processed in increasing order.
Mathematical Approach
The key idea is that every composite number is removed exactly once, namely when its smallest prime factor is reached, while primes are never removed. The algorithm turns that observation into coupled recurrences for filtered counts and filtered sums.
Step 1: Split the answer into prime and composite parts
If \(n\) is prime, then \(s(n)=n\). If \(n\) is composite, then \(s(n)\le \sqrt n\le \sqrt N\). Therefore
$S(N)=\sum_{\substack{q\le N\\ q\text{ prime}}} q+\sum_{p\le \sqrt N} p\,M_p(N),$
where \(M_p(N)\) denotes the number of composite integers \(n\le N\) whose smallest prime factor is exactly \(p\). So the entire task becomes: count those composites efficiently, and also recover the sum of all primes up to \(N\).
Step 2: Define the filtered set that survives before prime \(p\)
For a prime threshold \(p\), define the surviving set
$\mathcal{R}_p(x)=\left\{m\in\{2,\dots,x\}: s(m)\ge p\right\}\cup\left\{q\le x:q\text{ prime},\ q<p\right\}.$
This set contains two kinds of numbers: primes smaller than \(p\), which are kept forever, and numbers whose prime factors are all at least \(p\), which have not yet been removed. Now define
$C_p(x)=\#\mathcal{R}_p(x), \qquad T_p(x)=\sum_{m\in\mathcal{R}_p(x)} m.$
At the start no filtering has happened, so for \(p=2\) we simply have
$C_2(x)=x-1,\qquad T_2(x)=\sum_{m=2}^{x} m=\frac{x(x+1)}{2}-1.$
A useful consequence is that \(C_p(p)-C_p(p-1)=1\) exactly when \(p\) is prime, because every composite number has already disappeared before its own turn.
Step 3: Count composites whose smallest prime factor is \(p\)
Take a prime \(p\). Any composite \(n\le x\) with \(s(n)=p\) can be written uniquely as
$n=p\,m,$
where \(m\ge p\) and every prime factor of \(m\) is at least \(p\). Equivalently, \(m\) lies in \(\mathcal{R}_p(\lfloor x/p\rfloor)\), but the elements below \(p\) must be excluded because they are precisely the smaller primes. Hence
$M_p(x)=C_p\!\left(\left\lfloor\frac{x}{p}\right\rfloor\right)-C_p(p-1).$
The weighted contribution of all such composites is therefore
$p\,M_p(N)=p\left(C_p\!\left(\left\lfloor\frac{N}{p}\right\rfloor\right)-C_p(p-1)\right).$
There is no extra \(+p\) term here because the prime \(p\) itself remains in the filtered sum and is added later together with the other primes.
Step 4: Update the filtered counts and filtered sums
After the composites with smallest prime factor \(p\) have been identified, they must be removed from later stages. The count recurrence is
$C_{p^+}(x)=C_p(x)-\left(C_p\!\left(\left\lfloor\frac{x}{p}\right\rfloor\right)-C_p(p-1)\right),$
where \(p^+\) denotes the state immediately after processing \(p\). The corresponding sum recurrence removes the actual composite values \(p\,m\):
$T_{p^+}(x)=T_p(x)-p\left(T_p\!\left(\left\lfloor\frac{x}{p}\right\rfloor\right)-T_p(p-1)\right).$
Once every prime \(p\le \sqrt N\) has been processed, each composite number has been removed exactly once, so the surviving filtered sum is just the sum of all primes up to \(N\):
$T_{\mathrm{final}}(N)=\sum_{\substack{q\le N\\ q\text{ prime}}} q.$
Step 5: Compress all queried arguments to \(O(\sqrt N)\) values
Let
$v=\left\lfloor\sqrt N\right\rfloor.$
Every argument needed by the recurrence is either at most \(v\), or has the form \(\left\lfloor N/i\right\rfloor\) for some \(1\le i\le v\). This is the standard floor-quotient compression: the number of distinct values is only \(O(\sqrt N)\), not \(O(N)\). So the implementation stores two synchronized views,
$C_p(x),\ T_p(x)\quad \text{for }1\le x\le v,$
and
$C_p\!\left(\left\lfloor\frac{N}{i}\right\rfloor\right),\ T_p\!\left(\left\lfloor\frac{N}{i}\right\rfloor\right)\quad \text{for }1\le i\le v.$
That is why the memory usage stays near \(O(\sqrt N)\).
Worked Example: \(N=10\)
The smallest prime factors are
$s(2),\dots,s(10)=2,3,2,5,2,7,2,3,2,$
so the correct total is \(28\).
Initially, \(T_2(10)=2+3+\cdots+10=54\).
For \(p=2\),
$M_2(10)=C_2(5)-C_2(1)=4-0=4,$
corresponding to \(4,6,8,10\). Their weighted contribution is \(2\cdot 4=8\), and their actual sum \(4+6+8+10=28\) is removed from the filtered sum, leaving \(26\).
For \(p=3\), the current surviving set up to \(10\) is \(\{2,3,5,7,9\}\). Then
$M_3(10)=C_3(3)-C_3(2)=2-1=1,$
which corresponds to \(9\). Its weighted contribution is \(3\), and removing \(9\) leaves \(17\), exactly the sum of the primes \(2+3+5+7\).
Therefore
$S(10)=17+8+3=28.$
How the Code Works
The C++, Python, and Java implementations all begin with \(v=\lfloor\sqrt N\rfloor\) and build four tables: filtered counts and filtered sums for small arguments \(x\le v\), plus the same quantities for large arguments of the form \(\lfloor N/i\rfloor\). The initial values are the unfiltered counts \(x-1\) and the unfiltered sums \(\frac{x(x+1)}{2}-1\).
They then scan \(p=2,3,\dots,v\). A position is treated as prime exactly when the filtered count changes between \(p-1\) and \(p\). For each prime, the implementation first adds
$p\left(C_p\!\left(\left\lfloor\frac{N}{p}\right\rfloor\right)-C_p(p-1)\right)$
to the answer accumulator, accounting for every composite whose smallest prime factor is \(p\). It then applies the recurrence for \(C\) and \(T\) to every stored argument, selecting the small-table or large-table source according to whether the quotient lies below \(\sqrt N\).
The C++ and Python implementations keep exact integer sums during the whole process and reduce modulo \(10^9\) only at the end. The Java implementation keeps the sum tables modulo \(10^9\) during the updates while retaining exact counts; this is valid because only linear combinations of the sums are used, and the final answer itself is required modulo \(10^9\). After the prime loop finishes, the remaining filtered sum is the sum of all primes up to \(N\), so adding it to the accumulator yields \(S(N)\).
Complexity Analysis
Let \(v=\lfloor\sqrt N\rfloor\). The tables use \(O(v)=O(\sqrt N)\) memory. For each prime \(p\le v\), the implementation updates one range of length about \(\min\!\left(v,\left\lfloor N/p^2\right\rfloor\right)\) in the large-domain tables and one range of length about \(\max(0,v-p^2+1)\) in the small-domain tables. Summed over all primes, the work is roughly
$O\!\left(\sum_{p\le v}\left(\min\!\left(v,\frac{N}{p^2}\right)+\max(0,v-p^2)\right)\right),$
which is about \(O\!\left(N^{3/4}/\log N\right)\) for this split-domain sieve. The important point is practical rather than asymptotic finesse: both time and memory are far below any linear-in-\(N\) method, which is why \(N=10^{12}\) is manageable.
Footnotes and References
- Problem page: https://projecteuler.net/problem=521
- Prime factor: Wikipedia — Prime factor
- Prime-counting function: Wikipedia — Prime-counting function
- Meissel-Lehmer algorithm: Wikipedia — Meissel-Lehmer algorithm
- Floor and ceiling functions: Wikipedia — Floor and ceiling functions