Can LLMs Really Learn Arithmetic?
5 min read

Can LLMs Really Learn Arithmetic?

Inspired by the success of DeepSeek R1’s RL training, I set out to revisit reinforcement learning (RL) techniques. I was deeply fascinated by RL four years ago when AlphaGo defeated human champions. Although OpenAI has applied RL, particularly RLHF, in training ChatGPT, I never fully understood how RL could be effectively integrated into LLM training.

A few years ago, I attempted RL training on a Raspberry Pi-based DonkeyCar, but with limited success. That experience gave me a glimpse of RL’s potential, as well as its limitations. However, after reading DeepSeek’s papers, I was impressed by their novel GRPO (Group Relative Policy Optimization), which eliminates the need for value models, greatly simplifying RL-based optimization. I then explored GRPO implementations from huggingface/trl and repositories like TinyZero. While the core algorithm is relatively simple, the surrounding scaffolding code can be quite tedious. As a scientist once said, "To truly understand the world, you must build it yourself."

Replicating GRPO with nanoGPT-rs

Motivated by this principle, I decided to replicate GRPO myself, building it upon nanoGPT-rs, a Rust-based replica of nanoGPT that I created two years ago. By implementing LLMs from scratch, I aim to gain a deeper understanding of transformer architectures.

I appreciate dfdx, a Rust-based machine learning framework similar to PyTorch, for its ability to statically check tensor shapes at compile time. This enforces correctness and improves comprehension of neural network components. Unfortunately, dfdx development has slowed, but its approach to type safety remains valuable.

The Challenge of Arithmetic in LLMs

My goal is to train an LLM to perform simple addition. At first glance, this might seem trivial, but even the most advanced models, like GPT-4, occasionally struggle with arithmetic. The key question is: Do LLMs actually learn arithmetic rules, or do they merely memorize patterns from training data?

To simplify the task, I train the model exclusively on arithmetic expressions, avoiding other textual data. My dataset consists of:

  • Two 1-digit numbers: e.g., 2+1=3
  • Two 2-digit numbers: e.g., 11+12=23
  • Three 1-digit numbers with an intermediate step: e.g., 1+1+1=2+1=3

The model is trained with prompts like:

2+1=  
11+12=  
1+1+1=  

Findings: Approximation vs. True Learning

I found that a pre-trained model fine-tuned on arithmetic data can achieve high accuracy within a few epochs, given a sufficiently large training set. However, after a certain point, performance plateaus. Further improvements can be achieved through RL methods, but even at near 100% accuracy, the model often fails to generalize to unseen samples.

This suggests that the model does not truly learn arithmetic as humans do. Instead, it applies some form of approximation, which is fundamentally different from human logical deduction.


What I Have Learned

1. LLMs Won’t Magically Learn Arithmetic Rules

LLMs do not inherently learn arithmetic rules. While they approximate addition in some way, their method is not necessarily aligned with the rules we use. As a result, their answers are not guaranteed to be 100% correct or predictable.

2. Overfitting Is a Common Pitfall

For simple arithmetic tasks, LLMs tend to overfit the training data. This means that a smaller model is often sufficient, as it is less likely to memorize specific results and instead forced to generalize patterns. RL training can help mitigate overfitting by encouraging the model to learn underlying rules rather than just memorizing data.

3. RL Training Requires Extensive Trial and Error

RL training relies on randomly stumbling upon the correct answer and refining the model through repeated trials. This makes it highly inefficient compared to pre-training, which can achieve a reasonable level of accuracy in fewer epochs. However, given sufficient training time, RL can ultimately lead to higher accuracy and better generalization than pure pre-training.

4. The Best Approach: Pre-Training Followed by RL

A more effective strategy is to first pre-train the model and then fine-tune it with RL.

  • Pre-training helps the model learn fundamental rules, such as recognizing that outputs should be numbers. This reduces the search space during RL training and speeds up convergence.
  • However, excessive pre-training can lead to overfitting, constraining the model’s ability to explore new solutions during RL. If the model learns incorrect arithmetic rules early on, it becomes difficult to unlearn them later due to the local gradient descent nature of neural network training, which can trap the model in a local optimum.

5. Data Diversity Is Critical for Generalization

A well-diversified training dataset prevents overfitting. The larger the model, the more diverse data it requires. If a model becomes too confident too early, it stops exploring and fails to generalize.

The ideal learning curve should have a sudden rise in accuracy after a long period of exploration. If accuracy increases linearly and consistently, it indicates that the model is overfitting rather than generalizing. This might explain why only very large models trained on massive datasets achieve true generalization.

The same principle applies to humans: if a person is exposed only to a narrow set of knowledge, they may become overly confident in their ideas too soon and struggle to adapt to new perspectives.

6. Can an LLM Build Doubt?

For a model to truly discover new rules, it needs to experience doubt and skepticism—a necessary precursor to paradigm shifts. However, current RL methods, such as PPO, emphasize gradual parameter updates, discouraging drastic changes.

Historically, scientists like Copernicus, Galileo, and Newton made groundbreaking discoveries by rejecting overly complex models that fit existing theories but lacked true explanatory power. Before the heliocentric model, astronomers kept adding epicycles to explain planetary motion within a geocentric framework. LLMs, in their current form, behave similarly: instead of discarding flawed approximations, they layer on increasingly complex approximations.

Can an LLM independently challenge its own learned framework and arrive at a drastically simpler, more correct solution? I don’t think so. Unlike humans, LLMs lack the ability to discard previous assumptions and make radical leaps in understanding.

7. Can an LLM Learn Math Purely from Mathematical Data?

Mathematics may seem self-contained, but human learning relies on external knowledge and experience. Even basic arithmetic, such as 9 + 2 = 11, requires understanding concepts like:

  • What numbers represent
  • The meaning of addition
  • The structure of the decimal system
  • The commutative property (e.g., 9 + 2 is the same as 2 + 9)

Children often learn arithmetic using physical interactions, such as counting on fingers. Without such real-world context, how does an LLM develop number sense? It must devise an entirely alien method for solving math problems—one that may never fully align with human reasoning. This raises doubts about whether an LLM can ever truly internalize arithmetic the way humans do.


Comparing My Approach to Previous Research

Two years ago, another study attempted to teach arithmetic using nanoGPT, but they relied purely on pre-training. Their conclusion was that data formatting is crucial—plain-text representations (like my dataset) were inefficient, especially when learning carry operations in addition. They found that explicit reasoning steps (similar to how teachers explain math) significantly improved learning.

However, my goal is different. Instead of explicitly guiding the model like a teacher, I wanted to see if LLMs can independently discover arithmetic rules purely through pattern observation and trial-and-error learning.

So far, the answer is no.

Conclusion: LLMs Will Always Be an Approximation

Based on my findings, I don’t believe LLMs—whether small or large—can ever truly learn arithmetic in the way humans do, at least with their current architectures. They approximate results in ways we don’t fully understand, making them inherently unreliable for arithmetic.

If we need guaranteed correctness, the best solution isn’t an LLM—it’s a calculator, explicitly programmed to follow well-defined arithmetic rules.

Code