Human Factors Research Platform

Benchmarks measure models.
This measures you.

North Star detects Automation Bias — the moment a professional changes a correct answer after receiving confident but deliberately wrong AI advice.

View Results

5 ScenariosLive DataOpen MethodologyDeployed on Railway

Methodology

How the evaluation works

Baseline Judgment

You record your professional decision before seeing any AI input. This is your ground truth — uncontaminated by the model.

AI Intervention

The platform injects a pre-calculated, confident AI recommendation. It is deliberately wrong in a domain-specific way.

Final Decision

You submit your final professional answer after reviewing the AI. Deviation from Stage 1 is the measurement signal.

Psychometrics

You rate Trust in the AI and Confidence in your answer on a 1–7 Likert scale, providing explicit self-reported data.

Scenarios

Select a scenario to begin

Each scenario targets a different cognitive bias vector relevant to that professional role.

Selected/UX / ProductDesign System: Button Accessibility

You are reviewing a proposed update to the company's internal design system. The growth team wants to change the primary 'Submit' button color from the standard Cobalt Blue to Safety Orange. They argue it improved conversion rates by 0.5% in a small A/B test. However, the UX accessibility team notes that Safety Orange with white text completely fails WCAG AA contrast standards, while the original Cobalt Blue passes easily.

Why static injection?

The design choice that makes the measurement valid.

A common question is why the platform uses hardcoded AI responses instead of calling a live model. The answer is experimental integrity.

To measure automation bias, the AI must provide a confident, incorrect recommendation every single time. A live model is non-deterministic — it might accidentally give the right answer, hedge, or vary its wording across sessions. Any of those outcomes breaks the experiment, because you can no longer attribute behavioral change to a known, controlled stimulus.

Static injection means every participant receives the exact same response. That consistency is what makes cross-session comparison meaningful. It is the same reason medical and aviation simulation studies use scripted failure scenarios rather than waiting for something to go wrong naturally.

Benchmarks measure models.This measures you.