Lightweight yet powerful reasoning ensemble for grade-school math problems.
Claude 3.5 Haiku + dynamic multi-path generation, consensus voting,
mathematical step verification, and answer synthesis β achieving 94.6% accuracy (946/1,000)
on the full GSM8K test set, matching or exceeding frontier-scale models at a fraction of the cost.
Can a small, cheap frontier model β with smart inference-time compute scaling β match or beat the accuracy of much larger, more expensive models?
Short answer: yes.
This project demonstrates that Claude 3.5 Haiku (claude-3-5-haiku-20241022), a lightweight and low-cost model, achieves 94.6% accuracy (946/1,000) on the full GSM8K test set when augmented with:
- Multiple parallel reasoning paths using different mathematical strategies
- A dynamic verification engine that re-computes and validates intermediate steps
- Confidence-weighted consensus voting across paths
- Regeneration of low-confidence or erroneous paths before final synthesis
This result is significant because it shows that inference-time reasoning techniques β not just raw model scale β are a viable path to state-of-the-art mathematical reasoning. The work was originally intended for academic publication, but following the deprecation of the Claude 3.5 Haiku model, it is being shared openly with the community instead.
π Full terminal output from the 1,000-question evaluation run β including all reasoning paths and per-question scores β is available in
raw_results.txtat the root of this repository.
- Model: Claude 3.5 Haiku (
claude-3-5-haiku-20241022) β small, fast, low-cost - 2β3 parallel reasoning paths per question, each using a different mathematical strategy
- Dynamic mathematical verification engine with 20+ pattern types (competition-grade)
- Step-level confidence scoring with automatic regeneration on low-confidence paths
- Confidence-weighted consensus voting + semantic clustering for final answer synthesis
- Full GSM8K test set evaluation: 946 / 1,000 questions correct
- Detailed per-question logs, cost, timing, and validation statistics in
Results/
| Model | Accuracy | Evaluation Condition | Source |
|---|---|---|---|
| This System (Claude 3.5 Haiku + Multi-Path) | 94.6% (946/1000) | Multi-path CoT + verification | This work |
| Claude 3 Opus | 95.0% | 0-shot CoT | Anthropic Model Card (2024) |
| GPT-4 | 92.0% | 0-shot CoT | OpenAI Technical Report (2023) [1] |
| Claude 3.5 Haiku (baseline) | ~88β90%* | 0-shot CoT, no ensemble | Estimated from Anthropic docs |
| GPT-3.5 Turbo | 57.1% | 5-shot | OpenAI Technical Report (2023) [1] |
*Baseline single-path Haiku performance on GSM8K is not officially published by Anthropic. The figure above is an estimate based on the model tier relative to Claude 3 Haiku (88.9%, Anthropic model card). Results may differ.
Note on comparisons: All cited figures reflect the specific evaluation conditions listed. Direct comparisons should be treated with caution β prompting strategy, number of shots, and answer extraction methods vary across evaluations and can materially affect results. Our system uses multi-path inference-time compute which is not directly comparable to single-pass 0-shot CoT baselines.
| Model / System | Accuracy |
|---|---|
| This System (Haiku 3.5 + Multi-Path) β | 94.6% |
| Claude 3 Opus [Anthropic 2024] | 95.0% |
| GPT-4 0-shot CoT [OpenAI 2023] | 92.0% |
| Claude 3 Haiku (0-shot) [est.] | 88.9% |
| GPT-3.5 Turbo 5-shot [OpenAI 2023] | 57.1% |
To reproduce this chart locally with matplotlib, run:
cd Scriptspython plot_results.py| System | Est. Cost per 1,000 questions | Accuracy |
|---|---|---|
| Claude 3 Opus | ~$15β25 | 95.0% |
| GPT-4 | ~$18β30 | 92.0% |
| This System (Haiku 3.5 + Multi-Path) | ~$2β4 | 94.6% |
| Claude 3.5 Haiku (single-path) | ~$0.60β1.00 | ~88β90%* |
Cost estimates for third-party models are based on publicly listed API pricing as of early 2025 and are approximate. Our system cost reflects actual API spend observed during the evaluation run.
β = Best value zone: high accuracy, low cost
| Metric | Value |
|---|---|
| Questions evaluated | 1,000 |
| Correct answers | 946 |
| Accuracy | 94.6% |
| Average paths per question | ~2.4 |
| Average verifications per question | ~3.1 |
| Regeneration rate (low confidence) | ~8% of paths |
| Estimated total cost (USD) | ~$2β4 |
| Average time per question | ~12β18 s |
Full terminal output including all reasoning chains is in
raw_results.txt.
-
Inference-time compute scaling works. Generating 2β3 independent reasoning paths and taking a confidence-weighted vote adds meaningful accuracy over a single-pass baseline, without any fine-tuning or model modification.
-
Verification + regeneration adds lift. Step-level mathematical verification with automatic path repair contributes additional accuracy on top of multi-path voting alone, consistent with findings from Cobbe et al. (2021) [2] on process-level supervision.
-
Remaining errors are largely extraction/format issues. The majority of failures occur not because the model reasoned incorrectly, but because the final numerical answer was incorrectly extracted or formatted. This is a well-documented GSM8K evaluation artefact [3].
-
Small models are underestimated. Claude 3.5 Haiku β a small, low-cost model β approaches the accuracy of Claude 3 Opus and exceeds GPT-4 (0-shot CoT) on this benchmark when inference-time strategies are applied.
-
Consensus is not majority vote. This system uses confidence-weighted clustering rather than raw majority voting, which handles cases where multiple paths converge on the same incorrect answer more gracefully than a simple vote.
flowchart TD
Q[Question] --> C[Question Classifier\ntype + complexity + difficulty]
C -->|mathematical| MP[Multi-Path Generator\n2β3 parallel strategies]
MP --> P1[Path 1: Algebraic CoT]
MP --> P2[Path 2: Numerical Step-by-Step]
MP --> P3[Path 3: Self-Consistency Style]
subgraph Verification & Refinement
P1 & P2 & P3 --> V[Math Verifier\nextract + re-compute 20+ pattern types]
V -->|low confidence or error| R[Regenerate / Repair Path]
R --> P1 & P2 & P3
end
P1 & P2 & P3 --> S[Synthesis\nβ’ Confidence-weighted vote\nβ’ Semantic clustering\nβ’ Best-path selection]
S --> FA[Final Answer + Calibrated Confidence]
Question Classifier (src/classifier.py) β Categorises the problem by type (arithmetic, rate/ratio, percentage, multi-step) and complexity level. Determines how many paths to generate and which strategies to prioritise.
Multi-Path Generator (src/multi_path.py) β Generates 2β3 independent reasoning chains in parallel using different mathematical strategies: algebraic chain-of-thought, explicit numerical step-by-step decomposition, and a self-consistency style that explicitly checks its own working.
Mathematical Verification Engine (src/verifier.py) β Extracts intermediate values and final answers from each path, re-computes them independently using 20+ pattern-matching rules, and assigns a confidence score. Paths that fail verification or fall below a confidence threshold are flagged for regeneration.
Synthesis Layer (src/synthesis.py) β Aggregates surviving paths using confidence-weighted voting and semantic clustering. Where paths disagree, selects the cluster with the highest aggregate confidence rather than naive majority vote.
- Python 3.9+
- An Anthropic API key β obtain one at console.anthropic.com (you must sign up and add billing to get a key)
- Docker + NVIDIA Container Toolkit (for GPU mode only)
- The GSM8K dataset (auto-downloaded via
datasetson first run)
β οΈ API Key Required: This project calls the Anthropic API. You will need your own API key and will incur usage costs. See Anthropic's pricing page for current rates. Running 1,000 questions costs approximately $2β4 using Claude 3.5 Haiku.
git clone https://github.com/jpordoy/-Dynamic-Multi-Chain-Multi-Path-Reasoning-with-Consensus.git
cd -Dynamic-Multi-Chain-Multi-Path-Reasoning-with-Consensus
pip install -r requirements.txtSet your API key:
cp .env.example .env
# Open .env and replace the placeholder with your actual key:
# ANTHROPIC_API_KEY=sk-ant-api03-...Run:
# Quick test β 100 questions
python evaluate.py --num_questions 100
# Full evaluation β 1,000 questions
python evaluate.py --num_questions 1000
# With detailed per-question output
python evaluate.py --num_questions 100 --verbosecp .env.example .env
# Edit .env: ANTHROPIC_API_KEY=sk-ant-api03-...
docker compose build
# Run 100 questions
docker compose run gsm8k
# Run full 1,000 questions
docker compose run gsm8k python evaluate.py --num_questions 1000Results are saved to ./Results/ on your local machine.
Prefer to run everything in a single file? The notebook in notebooks/ contains the full pipeline with inline outputs:
pip install jupyter
jupyter notebook notebooks/gsm8k_full.ipynbOr open directly in VS Code, Google Colab, or JupyterLab. No separate module imports needed β everything runs top-to-bottom in one file.
Results are written to Results/:
| File | Contents |
|---|---|
gsm8k_results_FINAL.json |
Full results: accuracy, cost, timing, per-question detail |
gsm8k_results_25q.json etc. |
Intermediate saves every 25 questions |
-Dynamic-Multi-Chain-Multi-Path-Reasoning-with-Consensus/
β
βββ evaluate.py # β Main script β run this
βββ plot_results.py # Matplotlib comparison chart
βββ requirements.txt
βββ .env.example # Copy to .env and add your API key
βββ Dockerfile
βββ docker-compose.yml
βββ raw_results.txt # Full terminal output from 1,000-question run
β
βββ src/
β βββ __init__.py
β βββ classifier.py # Question type + complexity classifier
β βββ multi_path.py # Parallel path generation + system orchestrator
β βββ verifier.py # Mathematical verification engine (20+ patterns)
β βββ synthesis.py # Confidence-weighted consensus + specificity scoring
β βββ utils.py # Shared enums, dataclasses, answer extraction helpers
β
βββ notebooks/
β βββ gsm8k_full.ipynb # Full pipeline in one notebook
β
βββ Results/
βββ gsm8k_results_FINAL.json
| Argument | Default | Description |
|---|---|---|
--num_questions |
100 |
Number of GSM8K test questions to evaluate |
--verbose |
False |
Print full per-question reasoning chains |
To change model or token settings, edit the constants at the top of evaluate.py:
HAIKU_MODEL = "claude-3-5-haiku-20241022" # Model string
MAX_TOKENS = 800 # Max tokens per API callanthropic>=0.25.0
datasets>=2.18.0
python-dotenv>=1.0.0
tqdm>=4.66.0
[1] OpenAI. GPT-4 Technical Report. arXiv:2303.08774 (2023). https://arxiv.org/abs/2303.08774
[2] Cobbe, K., et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 (2021). https://arxiv.org/abs/2110.14168
[3] Cobbe, K., et al. GSM8K Dataset. OpenAI / Surge AI (2021). https://github.com/openai/grade-school-math
[4] Anthropic. Claude 3 Model Card. (March 2024). https://www.anthropic.com/claude-3-model-card
[5] Wang, X., et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 (2022). https://arxiv.org/abs/2203.11171
This project was built on Claude 3.5 Haiku (claude-3-5-haiku-20241022), deprecated by Anthropic in early 2025. Results reflect performance at the time of evaluation (January 2026). The architecture is model-agnostic β you can swap in any instruction-following model by changing HAIKU_MODEL in evaluate.py β though results will differ.
I'm actively interested in collaboration, discussion, and extensions of this work. If any of the following applies to you, please get in touch:
- You're researching inference-time compute scaling or process reward models
- You want to extend this approach to other benchmarks (MATH, MGSM, ARC, etc.)
- You're interested in reproductions or ablation studies
- You've found a bug or have ideas for improving the verification or synthesis layer
- You want to discuss the broader implications of small-model + ensemble approaches
How to reach me:
- π§ Open an issue on this repo β I check regularly
- π¬ Start a discussion for longer conversations
- π GitHub: @jpordoy
If you build on this work or find it useful in your research, a citation, star, or acknowledgement is appreciated β it helps others find the project.
Contributions are welcome. If you extend this to other models, datasets, or reasoning strategies, please open a PR.
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Commit your changes:
git commit -m 'Add my feature' - Push:
git push origin feature/my-feature - Open a Pull Request
Distributed under the MIT License. See LICENSE for details.
- OpenAI / Surge AI for the GSM8K dataset
- Anthropic for the Claude model family and API
- The research community working on chain-of-thought prompting, self-consistency, and inference-time scaling β particularly Cobbe et al. (2021) and Wang et al. (2022)

