Probe

AI-era talent
evaluated in a new way

Don't measure tool fluency — measure the thinking: how someone explains a problem to AI, validates the output, and chooses the right tool. Beyond the deliverable, “how they solved it” is captured automatically, so reviewers can see the whole process at a glance.

Request a free consult Tour all of Cofa

Evaluation areas: 5
Task time: 60–180 min
Auto-collected: 100%

Probe candidate workspace — task start, web IDE solving, submit preview, completion

Looking only at the deliverable
isn't enough anymore

Anyone can produce output with AI now. The same deliverable is worth different amounts depending on how it was made. Evaluation has to see the difference.

The deliverable alone won't tell you

From the final output, you can't distinguish between an AI-only result and one a person crafted with AI.

Tool fluency ≠ ability

A heavy Cursor user isn't necessarily a strong collaborator. The real signal is the thinking — decomposing a problem, validating the result.

Fairness gets shaky

If having a paid subscription or premium tooling moves the needle, you'll miss high-potential candidates.

Candidates solve;
reviewers see the process

Candidates get a smooth task environment; reviewers get the deliverable plus a timeline on one screen. Neither side has to write extra narrative.

01
Step 1
Task starts
Open the browser IDE and start solving. Built-in chat and terminal, plus top models provided free.
Web IDEBuilt-in AI modelsConnect external tools via MCP
02
Step 2
Process auto-captured
Prompts, tool switches, test runs, and git diffs are all recorded on the timeline. No write-up required from the candidate.
Prompt logTool switchingTest resultsgit diff
03
Step 3
Reviewer
Deliverable + auto-collected signals across 5 areas + qualitative notes — together on one screen. Pass/no-pass can be decided elsewhere.
Timeline view5-area scoringTags & notes

Five areas of AI evaluation

We map signals an evaluator can read from automatic capture vs. signals that need a human reviewer — separately, per area.

Human review · high

Problem framing

Track decomposition of requirements in the initial prompt and how sub-questions are split.

# of decomposition steps
Sub-question split pattern
Whether constraints/assumptions are stated

Human review · medium

Prompt design

Whether constraints and output format are specified, context attached, and system prompt used.

Output format specified
Context attached
System prompt usage

Human review · high

Validation · critical thinking

Follow-up rate, edits to AI output, edits before paste, test re-runs.

Follow-up rate
AI output edit count
Edits before paste
Test re-runs

Human review · medium

Tool use efficiency

Model choice patterns, tool switch frequency, completeness vs. prompt count.

Model choice patterns
Tool switch frequency
Completeness vs. prompts

Human review · entirely

Final output

git diff, test pass rate, and the result file — read directly. No automatic scoring.

git diff
Test pass rate
Result file

Reviewer tool

Timeline integration

Signals from all 5 areas in chronological order on one screen. Click an event → jump to the original prompt/diff/log.

Event → original jump
1–5 scoring per area
Tags & notes

Same starting line for everyone
and we collect only what's visible

The platform provides top Claude, OpenAI, and Gemini models for free. Whether a candidate has a paid subscription doesn't change the outcome. Before submission, candidates see exactly what records will be sent and can opt out per item.

Models provided: Free
Pre-submit preview: 100%
Auto-deletion: 90 days

A
Candidate A
· No paid subscription
ClaudeGPT-5Gemini Pro
Same environment
B
Candidate B
· Cursor Pro user
ClaudeGPT-5Gemini Pro
Same environment
C
Candidate C
· First time with AI tools
ClaudeGPT-5Gemini Pro
Same environment

Same models, same token quota for everyone. Personal subscriptions don't affect evaluation.

AI-era talentevaluated in a new way

Looking only at the deliverable isn't enough anymore