llms-benchmarking

Star

Here are 102 public repositories matching this topic...

steel-dev / awesome-web-agents

Star

🔥 A list of tools, frameworks, and resources for building AI web agents

awesome ai awesome-list browser-automation ai-agents llms llms-benchmarking

Updated Feb 27, 2026
Python

karthikv792 / LLMs-Planning

Star

An extensible benchmark for evaluating large language models on planning

planning pddl benchmark-suite llms llms-reasoning llms-benchmarking llms-planning

Updated Sep 17, 2025
PDDL

JonathanChavezTamales / llm-leaderboard

Star

A comprehensive set of LLM benchmark scores and provider prices. (deprecated, read more in README)

llm llmops llm-evaluation llm-agents llms-benchmarking

Updated Oct 24, 2025
JavaScript

bboylyg / BackdoorLLM

Star

[NeurIPS 2025] BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

attack backdoor defense llms llms-benchmarking

Updated Feb 2, 2026
Python

lechmazur / nyt-connections

Star

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

testing benchmark evaluation puzzles reasoning claude llm gpt-5 llms-benchmarking gemini-pro grok4

Updated Feb 23, 2026
Python

ChemFoundationModels / ChemLLMBench

Star

Official Code for What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks (In NeurIPS 2023)

nlp benchmark chemistry ai4science large-language-models llm llms-benchmarking

Updated Jul 26, 2024
Jupyter Notebook

lamalab-org / chembench

Star

How good are LLMs at chemistry?

benchmark machine-learning chemistry safety materials-science llm llms llms-benchmarking

Updated Jan 26, 2026
Python

lerogo / MMGenBench

Star

Official repository of MMGenBench

mllm llms-benchmarking mmgenbench

Updated Mar 8, 2025
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

FSoft-AI4Code / XMainframe

Star

Language Model for Mainframe Modernization

migration cobol mainframe code-summarization codellm llms-benchmarking

Updated Aug 23, 2024
Python

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

benchmark evaluation generalization llm llms llms-benchmarking sonnet3-7 gpt-4-5

Updated Sep 22, 2025

Portkey-AI / models

Star

This repository contains comprehensive pricing and configuration data for LLMs. It powers cost attribution for 200+ enterprises running 400B+ tokens through Portkey AI Gateway every day.

ai models llms llms-benchmarking

Updated Mar 4, 2026
JavaScript

SuperBruceJia / Awesome-Mixture-of-Experts

Star

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Updated Oct 6, 2025

rajpurkarlab / craft-md

Star

conversational-ai llms-benchmarking clinical-llm multiturn-conversations

Updated Mar 14, 2025
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Sep 2, 2025
Python

RaptorMai / MLLM-CompBench

Star

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes