Problem 89: Roman Numerals
View on Project EulerProject Euler Problem 89 Solution
EulerSolve provides an optimized solution for Project Euler Problem 89, Roman Numerals, with C++, Python, Java, and a step-by-step mathematical explanation.
Problem Summary The input file contains 1000 Roman numerals. Every numeral is valid, but some are not written in shortest form. The task is to rewrite each line in canonical minimal Roman notation and add up how many characters are saved. If \(s\) is one numeral, let \(\operatorname{val}(s)\) be its integer value and let \(R(n)\) be the minimal Roman representation of the integer \(n\). The required total is therefore $\sum_s \bigl(|s|-|R(\operatorname{val}(s))|\bigr).$ So the entire problem reduces to two precise operations on a single Roman string: evaluate it correctly, then reconstruct the same number in minimal form. Mathematical Approach The Roman alphabet is \(\{M,D,C,L,X,V,I\}\) with values \(1000,500,100,50,10,5,1\). Minimal Roman numerals also use the six subtractive pairs \(IV, IX, XL, XC, CD,\) and \(CM\). The implementations solve the problem entirely through these local value rules and the decimal block structure of Roman notation. Signed Contribution of Each Symbol Write a Roman string as \(s=s_1s_2\cdots s_m\), and let \(v(s_i)\) denote the value of the \(i\)-th symbol. In a valid numeral, a symbol contributes negatively exactly when it is the smaller member of a subtractive pair; otherwise it contributes positively....
Detailed mathematical approach
Problem Summary
The input file contains 1000 Roman numerals. Every numeral is valid, but some are not written in shortest form. The task is to rewrite each line in canonical minimal Roman notation and add up how many characters are saved.
If \(s\) is one numeral, let \(\operatorname{val}(s)\) be its integer value and let \(R(n)\) be the minimal Roman representation of the integer \(n\). The required total is therefore
$\sum_s \bigl(|s|-|R(\operatorname{val}(s))|\bigr).$
So the entire problem reduces to two precise operations on a single Roman string: evaluate it correctly, then reconstruct the same number in minimal form.
Mathematical Approach
The Roman alphabet is \(\{M,D,C,L,X,V,I\}\) with values \(1000,500,100,50,10,5,1\). Minimal Roman numerals also use the six subtractive pairs \(IV, IX, XL, XC, CD,\) and \(CM\). The implementations solve the problem entirely through these local value rules and the decimal block structure of Roman notation.
Signed Contribution of Each Symbol
Write a Roman string as \(s=s_1s_2\cdots s_m\), and let \(v(s_i)\) denote the value of the \(i\)-th symbol. In a valid numeral, a symbol contributes negatively exactly when it is the smaller member of a subtractive pair; otherwise it contributes positively. That gives the evaluation rule
$\operatorname{val}(s)=\sum_{i=1}^{m}\sigma_i\,v(s_i),\qquad \sigma_i= \begin{cases} -1,& i<m \text{ and } v(s_i)<v(s_{i+1}),\\ 1,& \text{otherwise}. \end{cases}$
This is why the parser only needs one-character lookahead. A rise in value means subtraction, while every non-rise means addition. For example,
$MCMXCIX=1000-100+1000-10+100-1+10=1999.$
The Minimal Token Table
To rebuild a number in shortest form, the implementations use the descending token list
$\mathcal{T}=\bigl[(1000,M),(900,CM),(500,D),(400,CD),(100,C),(90,XC),(50,L),(40,XL),(10,X),(9,IX),(5,V),(4,IV),(1,I)\bigr].$
Starting from an integer \(n\), repeatedly take the largest token whose value does not exceed the current remainder, append its Roman symbol, and subtract its value. Because the table already contains the subtractive forms, the output is produced directly in canonical notation.
Why the Greedy Reconstruction Is Minimal
The key fact is that Roman numerals split naturally by decimal place. If
$n=1000a+100b+10c+d,$
then the minimal Roman numeral is obtained by concatenating a thousands block, a hundreds block, a tens block, and a ones block.
For a place whose symbols are \((u,f,t)\), meaning one, five, and ten times that place value, the minimal digit blocks satisfy
$\bigl(\phi_{u,f,t}(0),\phi_{u,f,t}(1),\dots,\phi_{u,f,t}(9)\bigr)=\bigl(\varepsilon,u,uu,uuu,uf,f,fu,fuu,fuuu,ut\bigr).$
Thus the ones place uses \((I,V,X)\), the tens place uses \((X,L,C)\), and the hundreds place uses \((C,D,M)\). The thousands block is simply \(a\) copies of \(M\).
No shorter valid block exists for any digit. The digit 4 is best written as \(IV\), not \(IIII\); the digit 9 is best written as \(IX\), not \(VIIII\); the digit 8 is best written as \(VIII\), not a longer additive alternative. Since the decimal places do not interfere with one another, minimizing each block separately minimizes the whole numeral.
The descending table \(\mathcal{T}\) is exactly a procedural version of this place-value decomposition. After all thousands are removed, the remainder is below 1000, so the next token choices determine the hundreds digit, then the tens digit, then the ones digit.
Worked Example
Consider the non-minimal numeral \(MDCCCCLXXXXVIIII\). Its value is
$1000+(500+4\cdot100)+(50+4\cdot10)+(5+4\cdot1)=1999.$
Now decompose \(1999\) by decimal place:
$1999=1\cdot1000+9\cdot100+9\cdot10+9.$
The minimal blocks are therefore
$M,\qquad CM,\qquad XC,\qquad IX,$
so the canonical numeral is \(MCMXCIX\). The original string has length \(16\), the minimal string has length \(7\), and this one line saves
$16-7=9$
characters. The checkpoint-style simplifications \(VIIII\to IX\) and \(LXXXX\to XC\) are smaller instances of the same rule.
The Quantity Being Summed
Once \(R(n)\) is defined by the minimal reconstruction rule, the final answer is just
$\text{saved}=\sum_s \bigl(|s|-|R(\operatorname{val}(s))|\bigr).$
Each line is independent of every other line. There is no hidden global combinatorics here; the entire mathematics lives inside correct Roman evaluation and minimal Roman re-encoding.
How the Code Works
Interpreting One Numeral
The C++, Python, and Java implementations first map each Roman character to its integer value. They scan the string from left to right and use one-step lookahead: if the next symbol is larger, the current value is subtracted; otherwise it is added. This computes \(\operatorname{val}(s)\) in a single pass.
Building the Canonical Numeral
After obtaining the integer value, the implementation walks through the fixed table \(\mathcal{T}\) from 1000 down to 1. For each entry, it appends the corresponding Roman token as many times as possible before moving to the next smaller token. Because \(CM, CD, XC, XL, IX,\) and \(IV\) already appear in the table, the produced string is the minimal Roman numeral itself, not merely an equivalent one.
Accumulating the Total Saving
For each input line, the implementation compares the length of the original numeral with the length of the reconstructed minimal numeral and adds the difference to a running total. The three language versions differ only in syntax; the algorithmic structure is the same.
Complexity Analysis
If the input contains numerals \(s_1,\dots,s_N\) with total length \(L_{\text{tot}}=\sum |s_i|\), then parsing costs \(O(L_{\text{tot}})\), because every input character is examined once. Reconstruction is also linear in the length of the produced minimal numerals, and the token table has fixed size 13.
For this problem the numerals are short, so the constants are tiny. Asymptotically the whole computation is linear in the total amount of input text, and the extra memory usage is \(O(1)\) beyond the current line and the minimal string being built.
Footnotes and References
- Problem page: Project Euler 89
- Roman numerals: Wikipedia - Roman numerals
- Numeral systems: Wikipedia - Numeral system
- Greedy algorithms: Wikipedia - Greedy algorithm