#

llm-evaluation-toolkit

Here are 16 public repositories matching this topic...

Pacific-AI-Corp / langtest

Deliver safe & effective language models

nlp artificial-intelligence benchmarks benchmark-framework model-assessment ai-safety mlops responsible-ai ml-safety trustworthy-ai ethics-in-ai ml-testing large-language-models llm ai-testing llm-test llm-evaluation-toolkit llm-as-evaluator llm-testing

Updated Feb 19, 2026
Python

athina-ai / athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 6, 2025
Python

Re-Align / just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

Updated Jan 29, 2024
Python

parea-ai / parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

zli12321 / qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

qa-automation-test rl-training llm exact-matching llm-evaluation llm-evaluation-toolkit llm-evaluation-framework reward-modeling

Updated Jul 18, 2025
Python

zhuohaoyu / KIEval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 19, 2024
Python

CodeEval-Pro / CodeEval-Pro

[ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

code-generation llm llm-evaluation llm-evaluation-toolkit llm4code llm-reasoning

Updated Apr 7, 2025
Python

scalexi / scalexi

scalexi is a versatile open-source Python library, optimized for Python 3.11+, focuses on facilitating low-code development and fine-tuning of diverse Large Language Models (LLMs).

openai dataset-generation llms llm-evaluation-toolkit

Updated Feb 25, 2025
Python

Agenta-AI / job_extractor_template

Template for an AI application that extracts the job information from a job description using openAI functions and langchain

template example extraction extract-information openai extract-data unstructured-text llm langchain llmops openai-function-example llm-evaluation llm-evaluation-toolkit

Updated Dec 21, 2023
Python

ronniross / confidence-scorer

Measure of estimated confidence for non-hallucinative nature of outputs generated by Transformer-based Language Models.

dataset datasets llm llms llm-training llm-evaluation llms-reasoning llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework llm-evaluation-metrics llms-efficency llms-evalution

Updated Feb 26, 2026
Python

parea-ai / parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jan 17, 2025
TypeScript

thomasmarwitz / campus_plan_bot

LLM-based chatbot using RAG to guide people around the KIT campus using natural language

rag llm llm-evaluation-toolkit rag-chatbot

Updated Nov 1, 2025
Jupyter Notebook

Aysnc-Labs / llm-eval

A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.

php llm llm-eval llm-evaluation llm-evaluation-toolkit llm-evaluation-framework

Updated Feb 10, 2026
PHP

nhsengland / evalsense

Tools for systematic large language model evaluations

evaluation-metrics evaluation-framework llm llms llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics llm-benchmarking

Updated Mar 3, 2026
Python

EricLiclair / prayog-IndicInstruct

Indic evals for quantised models AWQ / GPTQ / EXL2

indic-languages llm-evaluation-toolkit

Updated Mar 5, 2024
Python

ankur28121982 / bkankur-llm-guardrails

Production-ready LLM evaluation & guardrails toolkit (provider-agnostic). Generate explainable metrics and ALLOW/WARN/BLOCK recommendations.

fairness ai-safety llm llm-evaluation rag-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-guardrails prompt-injection-llm-security

Updated Dec 29, 2025
Python

Improve this page

Add a description, image, and links to the llm-evaluation-toolkit topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation-toolkit topic, visit your repo's landing page and select "manage topics."