
Closed
Posted
Paid on delivery
I’m building a new benchmark to measure how well frontier language models cope with genuine scientific workflow. Your role is to craft one self-contained, terminal-driven research task that feels exactly like real lab or data-science work—analysing raw data, running simulations, validating a hypothesis, comparing competing methods—rather than a polished textbook exercise. The task should force multi-step reasoning and code composition so thoroughly that today’s best models fail at least 80 % of the time, while an expert human (you) can still solve it reproducibly. What I need from you • A complete task package zipped together: – [login to view URL] explaining the workflow, required inputs, expected outputs and the objective success criteria – A fully reproducible Docker environment with every dependency and the dataset already inside (no outside downloads) – [login to view URL], an “oracle” reference solution that passes the local tests three-for-three – A deterministic test suite, callable from the command line, that verifies objective success or failure without human judgment – [login to view URL] metadata so the benchmark harness can auto-discover and grade the task Quality expectations • Multi-step logic, genuine research flavour, and objective numerically-verifiable outputs • No LLM-generated content anywhere in the task materials; everything must be authored by you • The oracle must run cleanly inside the container on a fresh machine and reproduce identical results each time Acceptance criteria 1. Running `./[login to view URL]` inside the Docker container returns all passes without flakiness. 2. Removing or altering any key step in [login to view URL] causes at least one test to fail. 3. Frontier models (GPT-4-Turbo or Gemini-3.5-Pro) fail the tests in more than 4 out of 5 blind trials. Ideal background You’re comfortable designing authentic research pipelines in biology, chemistry, physics, data science or machine learning, and you know your way around Python, Bash and Docker well enough to make everything turnkey. If this sounds like a stimulating challenge, let’s talk through your proposed topic and dataset so you can start building.
Project ID: 40480451
45 proposals
Remote project
Active 2 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
45 freelancers are bidding on average $145 USD for this job

Hello, I would love if i get the chance to work on your project. This is a challenge I would genuinely enjoy. I can design a research style benchmark with Python, Bash, Docker, NumPy, Pandas, SciPy, and a deterministic test framework that requires hypothesis validation, data analysis, simulation, and multi step reasoning rather than simple code generation. One thing I'd like to understand: do you want failure to come primarily from reasoning mistakes, experimental design mistakes, or hidden dependencies between analysis stages? That decision has a huge impact on creating a task that humans solve reliably while frontier models consistently fail. Can we connect over a chat to discuss more about the project? Best regards, Dev Singh
$250 USD in 4 days
6.6
6.6

Hi, Your project to design a robust STEM AI benchmark that simulates genuine scientific workflow is truly thrilling! With strong expertise in Python, Bash scripting, Docker, and data science, I am confident in crafting a multi-step, research-driven terminal task that pushes frontier language models to their limits while remaining solvable by experts. My approach involves designing a reproducible pipeline with realistic raw data analysis, hypothesis validation, and model comparison steps, all neatly packaged with deterministic tests and a clean Docker environment ready to run. I would love to start by discussing candidate topics and datasets that fit your vision, ensuring the task challenges AI models as intended while maintaining scientific rigor and reproducibility. From there, I can deliver a fully self-contained task package within an agreed timeline. Looking forward to your thoughts and the opportunity to collaborate on something truly cutting-edge. Which scientific domain or dataset do you envision for the benchmark task to maximize both complexity and relevance? Best regards,
$155 USD in 13 days
4.2
4.2

Hi, I’m a multidisciplinary engineer focused on design + analysis, delivering production ready CAD and simulation-backed solutions. I work across mechanical and architectural scopes, with clear drawings, practical detailing, and fast communication. Feel free to check my profile for similar projects ,happy to review your scope and suggest the best next steps. G. Zaralı
$140 USD in 7 days
4.3
4.3

Hi, This is an interesting benchmark-design task, and I understand that you are not looking for a toy coding puzzle. You need a self-contained research-style workflow that feels like real scientific or data-science work, with raw inputs, multi-step reasoning, deterministic validation, Docker packaging, an oracle solution, and objective tests. I can design a terminal-driven task around a realistic data-analysis pipeline where the participant must clean noisy data, choose the correct processing assumptions, validate a hypothesis, and produce numerically checked outputs. My focus would be making the task reproducible for an expert human while still difficult for frontier models because success depends on several connected decisions rather than one obvious script. I’m comfortable with Python, Bash, Docker, reproducible environments, test harnesses, and building objective pass/fail checks. I would include the dataset inside the container, provide a reference oracle solution, write deterministic command-line tests, and prepare the required task metadata for auto-discovery. One topic I would propose is a noisy experimental time-series or simulation-validation workflow where incorrect preprocessing, leakage, or wrong model comparison causes measurable test failures. P.S. I care about benchmark quality, so I would design the tests to catch shallow solutions instead of only checking whether a file exists or a script runs.
$150 USD in 3 days
3.7
3.7

Hi, I've built automated workflows for complex scientific tasks involving data analysis, simulations, and hypothesis testing, which closely align with your requirements. I can create a detailed, self-contained task that forces multi-step reasoning and code composition, ensuring that frontier language models struggle with at least 80% of the cases. Let's start with a small test task to ensure it meets your needs before moving forward. Best Regards, Ivica
$140 USD in 7 days
3.6
3.6

Hello! I’ve built a similar benchmark system to evaluate language models in scientific workflows, resulting in a 70% performance drop for models when faced with complex multi-step reasoning tasks. I’d love to share how I approached it and the implementation details in our chat. For your project, I’d focus on creating a task that involves raw data analysis and simulation, ensuring it demands genuine problem-solving skills while being reproducible in a Docker environment. What specific scientific domain are you considering for this benchmark? If you’re open, I can share my previous build that aligns with your goals, and we can see if it fits.
$140 USD in 7 days
0.6
0.6

Hi, I can help with a robust STEM AI benchmark that reliably grades frontier models on a reproducible scientific workflow. I’ll start by drafting a terminal-driven research task specification (objective, inputs, outputs, and numeric success criteria), then build a Dockerized environment with an embedded dataset plus a deterministic test suite. sh that passes locally three-for-three and ensure tests fail if key steps are changed. , hypothesis validation, simulation fitting, or method comparison)? And should the oracle be Python-first or Bash+Python? If you share your preferred angle, I’ll propose a concrete task plan and file structure.
$84 USD in 3 days
0.4
0.4

just saw your benchmark project — the "today’s best models fail at least 80% of the time" part is what caught my attention. i built a similar gauntlet-style test for data pipelines where i forced models to reconstruct messy sensor logs with missing timestamps and unit mismatches, and even gpt-4 kept hallucinating. for your terminal-driven science task, i’m thinking about a raw spectrometry dataset where the model must calibrate, run a monte carlo simulation to check significance, then compare two statistical methods — textbook approach fails because the noise isn’t gaussian. are you envisioning the task as one long sequential pipeline or branching paths where different reasoning strategies lead to different valid outputs?
$30 USD in 2 days
0.4
0.4

I can design a robust STEM AI benchmark tailored to accurately measure frontier language models' scientific reasoning capabilities. My experience includes developing evaluation frameworks for AI models in technical domains, ensuring rigorous and relevant assessment criteria. I would focus on creating clear, scalable metrics that capture genuine scientific understanding, backed by diverse test cases spanning key STEM areas. Do you already have specific model types or scientific disciplines in mind for this benchmark?
$140 USD in 7 days
0.0
0.0

Testing frontier models on hard STEM problems is tricky because sourcing questions that require multi-step reasoning, not pattern matching, is the hardest part. I can build the dataset pipeline in Python with Docker and have a working benchmark scaffold ready in 4 days. Available to start today. The bid reflects what is in the description. Final numbers depend on the domains and question count we settle on. Want to jump on a quick call?
$150 USD in 10 days
0.0
0.0

You need a benchmark that finally exposes the reasoning gaps in frontier models by simulating a genuine, messy scientific workflow. I will design a self-contained research task—centered on a complex data science or physics pipeline—that requires precise multi-step code composition and numerical validation, ensuring the >80% failure rate you require. I’ll handle the entire technical stack: from crafting the raw dataset and the deterministic test suite to building the reproducible Docker environment. You won't need to worry about flakiness or environment drift; the package will be truly turnkey. My plan: 1. Propose a high-complexity STEM topic and dataset for your approval. 2. Build the Dockerized environment and raw data inputs. 3. Develop the oracle solution and the command-line validation suite. 4. Verify failure rates against GPT-4/Gemini. Within 48 hours, you'll have a detailed proposal of the scientific problem and the success criteria. Everything will be authored manually, without LLM assistance, to maintain the benchmark's integrity. Let’s discuss the specific scientific domain you're prioritizing for this task.
$250 USD in 7 days
0.0
0.0

Hi, You're looking to create a challenging benchmark that truly tests how well advanced language models handle real-world scientific tasks. I appreciate the focus on developing a self-contained research task that mimics authentic lab work, demanding multi-step reasoning and code composition. My approach would involve designing a task around a compelling dataset, ensuring it requires complex analysis and simulation that would stump even the best models while remaining reproducible for human experts. With a strong background in data science and experience in building robust, reproducible environments using Python, Bash, and Docker, I can create all the necessary components. This includes the instruction manual, Docker setup, oracle solution, and a deterministic test suite, ensuring everything runs smoothly and meets your criteria. I'm excited about the opportunity to contribute to this innovative project and deliver a benchmark that challenges frontier models effectively. Best regards, Novalitz Tech
$30 USD in 3 days
0.0
0.0

Hello , I currently lack any work experience; consequently, I am searching for a platform where I can build a skills profile, develop myself into a competent employee, and secure a position at a reputable company.
$140 USD in 7 days
0.0
0.0

Hi, I've designed and executed complex research workflows in biology and data science, focusing on multi-step logic and reproducibility. My experience crafting detailed, authentic research tasks aligns well with your project needs. If you'd like, we can start with a small trial task to ensure alignment before committing to the full project. Best Regards, Rosmar
$140 USD in 7 days
0.0
0.0

Hello, this is exactly the kind of challenge I enjoy. I have experience building reproducible research pipelines, benchmark tasks, Dockerized environments, automated grading systems, and data-science workflows involving simulation, statistical validation, and machine learning evaluation. Rather than creating a synthetic puzzle, I would design a realistic scientific task with hidden failure modes, requiring data cleaning, model selection, uncertainty analysis, and reproducible reporting. The package would include a fully self-contained Docker image, deterministic tests, oracle solution, benchmark metadata, and objective pass/fail criteria. My focus is creating a task that feels like genuine research work while remaining reproducible, verifiable, and difficult for current frontier models to solve consistently. I would be excited to contribute a benchmark that meaningfully measures scientific reasoning rather than prompt-following ability, and I would love to work with you on this project.
$100 USD in 5 days
0.0
0.0

We recently helped a client develop a cutting-edge AI project that pushed the boundaries of innovation and scientific exploration. We can assist you in crafting a robust STEM AI benchmark that simulates real scientific workflows with multi-step reasoning and code composition challenges. Our task package will include detailed instructions, a reproducible Docker environment, an oracle reference solution, a test suite for objective verification, and metadata for seamless grading. I understand the need for genuine research flavor and objective, numerically-verifiable outputs in the benchmark materials. Our expertise in designing research pipelines in various fields, coupled with proficiency in Python, Bash, and Docker, ensures a seamless execution. I’d love to discuss your project further and explore how we can contribute to this exciting endeavor. Regards, Melgard.
$150 USD in 7 days
0.0
0.0

I'd love to chat about your project, the worst that can happen is you walk away with a FREE CONSULTATION. We've recently helped a client achieve a similar goal by creating a comprehensive research task package for a benchmarking project. We'll help you design a robust STEM AI benchmark that accurately measures the performance of frontier language models in scientific workflows. Our expertise lies in creating user-friendly, authentic research pipelines with reproducible results. Based on your project description, we understand the need for a clean, professional, and reproducible task package. Our skills in Python, Bash, and Docker ensure a seamless integration of all components required for the benchmark. I turn complex requirements into clean, seamless solutions that feel effortless for the user and powerful for you. Regards, Clinton.
$100 USD in 7 days
0.0
0.0

I've recently worked on projects involving creating authentic research pipelines in various fields like data science, automation systems, and machine learning, focusing on delivering genuine user experiences and robust functionality. I can help you achieve a benchmark that challenges the best language models by crafting a self-contained research task with multi-step reasoning and reproducible solutions, meeting the objective success criteria. Success will be a task package with a Docker environment, reproducible solution, test suite, and metadata for seamless grading. One specific detail I noted is the need for multi-step logic and objective verifiable outputs, ensuring no LLM-generated content. My approach involves clear communication, structured execution, and proactive problem-solving to meet the acceptance criteria. I may not have reviews yet, but your project will receive my full attention, ensuring greater care for details and reliable delivery. I'd be happy to discuss further and share ideas to enhance the project outcome. Regards, stPatrickMoloedi
$150 USD in 7 days
0.0
0.0

This aligns perfectly with my skill set. I understand the need for a robust STEM AI benchmark that feels like genuine scientific workflow—clean, professional, and reproducible. With my expertise in designing research pipelines and proficiency in Python, Bash, and Docker, I can deliver seamlessly integrated tasks that challenge even the best models. While I am new to freelancer, I have tons of experience and have done other projects off site. I would love to chat more about your project! Regards, Warrick Van Eeden
$100 USD in 7 days
0.0
0.0

I am highly motivated, detail-oriented, and committed to delivering accurate, high-quality work. I have strong analytical skills and experience working with data-related tasks, including data collection, review, and quality assurance. I am a fast learner, follow instructions carefully, and consistently meet deadlines. My ability to maintain accuracy while handling large volumes of information makes me a strong candidate for this project. I am eager to contribute to the success of your team and provide reliable results that meet project requirements.
$140 USD in 7 days
0.0
0.0

Taichung, Poland
Member since Jan 8, 2026
$1500-3000 USD
₹600-1500 INR
$30-250 USD
$2-8 USD / hour
$30-250 USD
₹75000-150000 INR
₹37500-75000 INR
₹600-1500 INR
$15-60 USD / hour
₹600-1500 INR
$15-25 USD / hour
₹12500-37500 INR
$30-250 USD
₹600-1500 INR
$30-250 USD
₹1500-12500 INR
$250-750 USD
₹12500-37500 INR
₹1500-12500 INR
$2-8 USD / hour