General Knowledge Evaluator (MultiChallenge benchmark)

Remote, USA Full-time Posted 2025-11-03

hiring expert evaluators to support a high-impact reasoning evaluation workflow in partnership with a leading AI research lab. This work centers on the MultiChallenge benchmark, which is designed to test large language models (LLMs) on multi-turn conversational reasoning — a capability where even top models fall short today. This benchmark does not focus on domain-specific expertise, but rather on reasoning consistency, instruction retention, and contextual inference across loosely structured conversations on general topics. Don't forget to copy and paste the referral link to get to the recruiting platform's next step, which is to register, upload your resume, and take an AI-led interview. https://work.mercor.com/jobs/list_AAABmHvRFlyxHvS-YFdJCbzl?referralCode=9df2a9e1-2f06-11ef-ae42-42010a400fc4&utm_source=referral&utm_medium=share&utm_campaign=job_referral What is MultiChallenge? MultiChallenge is a newly released benchmark targeting reasoning failures that occur in multi-turn interactions between humans and LLMs. It evaluates four categories of failure modes: • Instruction Retention – Does the model persistently follow instructions across turns? • Inference Memory – Can it infer or recall relevant user details from earlier conversation history? • Reliable Versioned Editing – Can it revise content through multi-step iteration without forgetting or hallucinating? • Self-Coherence – Does it contradict its earlier claims, particularly under user pressure? The benchmark is designed to surface realistic, high-difficulty conversational reasoning challenges. Despite scoring highly on other multi-turn benchmarks, current frontier models achieve less than 50% accuracy on MultiChallenge. Who We're Hiring We are seeking evaluators with strong backgrounds in Logic, Philosophy, or related disciplines — particularly those trained to track argument structure, detect reasoning errors, and evaluate coherence across extended discourse. This workflow is ideal for individuals with academic experience in: • Logic • Analytic Philosophy • Epistemology • Formal Semantics • Cognitive Science • Linguistics (with a reasoning focus) Key Responsibilities • Evaluate the reasoning quality of LLM outputs across 8–10 turn conversations. • Identify errors in instruction-following, factual coherence, inference, and revision handling. • Complete evaluations using a structured rubric and short written justifications. • Work asynchronously using provided tools and examples. You’re a Strong Fit If You Have: • A PhD (or are currently a PhD candidate) in Logic, Philosophy, or a closely related field. • Experience analyzing or writing complex arguments. • Excellent written communication and generalist reasoning ability. • Comfort working independently and asynchronously. • (Optional) Familiarity with Python or LLM evaluation tools is helpful but not required. Role Details • Part-time (10–20 hours/week) with flexible scheduling. • 100% remote and asynchronous — work from anywhere. • Contractor position via Mercor, paid hourly. • Competitive rates: $20–$35/hour depending on expertise. • Weekly payments processed securely through Stripe Connect. Job Types: Contract, Temporary Pay: $20.00 - $35.00 per hour Expected hours: 10 – 20 per week Work Location: Remote Apply tot his job Apply To this Job

Apply Now

General Knowledge Evaluator (MultiChallenge benchmark)

Similar Jobs

Family Flexible Supports - Clinician (20 Hours)

Associate Consultant, Healthcare

Remote Patient Monitoring Specialist LPN/LVN - CA License + Compact State License

CVS Data Entry Remote Jobs $27/Hour - Work From Home Job

Immediate Hiring: Customer Service Representative - Seasonal Full

Data Entry For AI Development for Copywriters

Registered Nurse, Med Surg, 24hr, Nights

Investment Banking Managing Director, Debt Capital Markets

Wells Fargo Remote Job San Antonio $27Hr - VacancyGlobal

Principal HR Data Analyst