General Knowledge Evaluator (MultiChallenge benchmark)
hiring expert evaluators to support a high-impact reasoning evaluation workflow in partnership with a leading AI research lab. This work centers on the MultiChallenge benchmark, which is designed to test large language models (LLMs) on multi-turn conversational reasoning — a capability where even top models fall short today.
This benchmark does not focus on domain-specific expertise, but rather on reasoning consistency, instruction retention, and contextual inference across loosely structured conversations on general topics.
Don't forget to copy and paste the referral link to get to the recruiting platform's next step, which is to register, upload your resume, and take an AI-led interview.
https://work.mercor.com/jobs/list_AAABmHvRFlyxHvS-YFdJCbzl?referralCode=9df2a9e1-2f06-11ef-ae42-42010a400fc4&utm_source=referral&utm_medium=share&utm_campaign=job_referral
What is MultiChallenge?
MultiChallenge is a newly released benchmark targeting reasoning failures that occur in multi-turn interactions between humans and LLMs. It evaluates four categories of failure modes:
• Instruction Retention – Does the model persistently follow instructions across turns?
• Inference Memory – Can it infer or recall relevant user details from earlier conversation history?
• Reliable Versioned Editing – Can it revise content through multi-step iteration without forgetting or hallucinating?
• Self-Coherence – Does it contradict its earlier claims, particularly under user pressure?
The benchmark is designed to surface realistic, high-difficulty conversational reasoning challenges. Despite scoring highly on other multi-turn benchmarks, current frontier models achieve less than 50% accuracy on MultiChallenge.
Who We're Hiring
We are seeking evaluators with strong backgrounds in Logic, Philosophy, or related disciplines — particularly those trained to track argument structure, detect reasoning errors, and evaluate coherence across extended discourse.
This workflow is ideal for individuals with academic experience in:
• Logic
• Analytic Philosophy
• Epistemology
• Formal Semantics
• Cognitive Science
• Linguistics (with a reasoning focus)
Key Responsibilities
• Evaluate the reasoning quality of LLM outputs across 8–10 turn conversations.
• Identify errors in instruction-following, factual coherence, inference, and revision handling.
• Complete evaluations using a structured rubric and short written justifications.
• Work asynchronously using provided tools and examples.
You’re a Strong Fit If You Have:
• A PhD (or are currently a PhD candidate) in Logic, Philosophy, or a closely related field.
• Experience analyzing or writing complex arguments.
• Excellent written communication and generalist reasoning ability.
• Comfort working independently and asynchronously.
• (Optional) Familiarity with Python or LLM evaluation tools is helpful but not required.
Role Details
• Part-time (10–20 hours/week) with flexible scheduling.
• 100% remote and asynchronous — work from anywhere.
• Contractor position via Mercor, paid hourly.
• Competitive rates: $20–$35/hour depending on expertise.
• Weekly payments processed securely through Stripe Connect.
Job Types: Contract, Temporary
Pay: $20.00 - $35.00 per hour
Expected hours: 10 – 20 per week
Work Location: Remote
Apply tot his job
Apply To this Job