Skip to content

AI-era talent
evaluated in a new way

Don't measure tool fluency — measure the thinking: how someone explains a problem to AI, validates the output, and chooses the right tool. Beyond the deliverable, how they solved it is captured automatically, so reviewers can see the whole process at a glance.

Evaluation areas
5
Task time
60–180 min
Auto-collected
100%
Probe candidate workspace — task start, web IDE solving, submit preview, completion

Looking only at the deliverable isn't enough anymore

Anyone can produce output with AI now. The same deliverable is worth different amounts depending on how it was made. Evaluation has to see the difference.

The deliverable alone won't tell you

From the final output, you can't distinguish between an AI-only result and one a person crafted with AI.

Tool fluency ≠ ability

A heavy Cursor user isn't necessarily a strong collaborator. The real signal is the thinking — decomposing a problem, validating the result.

Fairness gets shaky

If having a paid subscription or premium tooling moves the needle, you'll miss high-potential candidates.

Candidates solve; reviewers see the process

Candidates get a smooth task environment; reviewers get the deliverable plus a timeline on one screen. Neither side has to write extra narrative.

  1. 01
    Step 1

    Task starts

    Open the browser IDE and start solving. Built-in chat and terminal, plus top models provided free.

    Web IDEBuilt-in AI modelsConnect external tools via MCP
  2. 02
    Step 2

    Process auto-captured

    Prompts, tool switches, test runs, and git diffs are all recorded on the timeline. No write-up required from the candidate.

    Prompt logTool switchingTest resultsgit diff
  3. 03
    Step 3

    Reviewer

    Deliverable + auto-collected signals across 5 areas + qualitative notes — together on one screen. Pass/no-pass can be decided elsewhere.

    Timeline view5-area scoringTags & notes

Five areas of AI evaluation

We map signals an evaluator can read from automatic capture vs. signals that need a human reviewer — separately, per area.

Human review · high

Problem framing

Track decomposition of requirements in the initial prompt and how sub-questions are split.

  • # of decomposition steps
  • Sub-question split pattern
  • Whether constraints/assumptions are stated
Human review · medium

Prompt design

Whether constraints and output format are specified, context attached, and system prompt used.

  • Output format specified
  • Context attached
  • System prompt usage
Human review · high

Validation · critical thinking

Follow-up rate, edits to AI output, edits before paste, test re-runs.

  • Follow-up rate
  • AI output edit count
  • Edits before paste
  • Test re-runs
Human review · medium

Tool use efficiency

Model choice patterns, tool switch frequency, completeness vs. prompt count.

  • Model choice patterns
  • Tool switch frequency
  • Completeness vs. prompts
Human review · entirely

Final output

git diff, test pass rate, and the result file — read directly. No automatic scoring.

  • git diff
  • Test pass rate
  • Result file
Reviewer tool

Timeline integration

Signals from all 5 areas in chronological order on one screen. Click an event → jump to the original prompt/diff/log.

  • Event → original jump
  • 1–5 scoring per area
  • Tags & notes

Same starting line for everyone
and we collect only what's visible

The platform provides top Claude, OpenAI, and Gemini models for free. Whether a candidate has a paid subscription doesn't change the outcome. Before submission, candidates see exactly what records will be sent and can opt out per item.

Models provided
Free
Pre-submit preview
100%
Auto-deletion
90 days
  1. A

    Candidate A

    · No paid subscription
    ClaudeGPT-5Gemini Pro
    Same environment
  2. B

    Candidate B

    · Cursor Pro user
    ClaudeGPT-5Gemini Pro
    Same environment
  3. C

    Candidate C

    · First time with AI tools
    ClaudeGPT-5Gemini Pro
    Same environment
Same models, same token quota for everyone. Personal subscriptions don't affect evaluation.