🔥 A list of tools, frameworks, and resources for building AI web agents
-
Updated
Feb 27, 2026 - Python
🔥 A list of tools, frameworks, and resources for building AI web agents
An extensible benchmark for evaluating large language models on planning
A comprehensive set of LLM benchmark scores and provider prices. (deprecated, read more in README)
[NeurIPS 2025] BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
Official Code for What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks (In NeurIPS 2023)
How good are LLMs at chemistry?
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language Model for Mainframe Modernization
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
This repository contains comprehensive pricing and configuration data for LLMs. It powers cost attribution for 200+ enterprises running 400B+ tokens through Portkey AI Gateway every day.
Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)
Develop reliable AI apps
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes
Constrained Decoding of Diffusion LLMs with Context-Free Grammars.
Training and Benchmarking LLMs for Code Preference.
Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
Restore safety in fine-tuned language models through task arithmetic
Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."