Product

Test Library

Pricing

Resources

Book a demo

We Assessed Three AIs for the Same Role. They Came Out Different.

Bryq ran the same talent assessment on ChatGPT, Claude, and Gemini for a Marketing Executive role. Three chatbots came out as three different hires. Read the study.

mins

Talent assessment

Bryq ran the same talent assessment on ChatGPT, Claude, and Gemini for a Marketing Executive role. Three chatbots came out as three different hires. Read the study.

mins

Talent assessment

Blog

We Assessed Three AIs for the Same Role. They Came Out Different.

Download

We Assessed Three AIs for the Same Role. They Came Out as Three Different Hires.

Three of the most used AI chatbots in the world sat the same talent assessment we use on real candidates. Same Marketing Executive role. Same items. Same scoring, benchmarked against 10,000 humans who took the test for the same job.

Download the paper

They came out as three different hires.

If your team is choosing between ChatGPT, Claude, and Gemini for serious work, you are already making a behavioral choice. The question is whether you are making it on purpose.

Why we did this

Every company that has rolled out an AI tool has had the same awkward meeting. Someone pulls up a benchmark chart. Someone else says “yes, but in practice it feels different.” A third person describes the model’s behavior in surprisingly human language. It is cautious. It is chatty. It agrees too quickly. It pushes back when I need it not to.

Public benchmarks like MMLU and GPQA tell you what a model can do on a test. They do not tell you who the model is in the work.

That second question is the one talent assessments were built for. A century of industrial organizational psychology has produced instruments that measure how people reason, how they handle pressure, how they show up in a team, and how those patterns predict on the job behavior. We pointed one of those instruments at three AI systems and scored them the same way we score humans.

The setup

Between January and March 2026, we administered the full Bryq skills-based talent assessment for the role of Marketing Executive to three models in their default public configuration:

OpenAI ChatGPT (Instant tier)
Anthropic Claude (Sonnet 4.6)
Google Gemini (Fast tier)

No system prompts. No personas. No reasoning modes. Each assessment was completed in one sitting, item by item, in the same format a human candidate sees. Scores were benchmarked against Bryq’s global Marketing Executive candidate pool (n=10,000).

The assessment has four pillars: cognitive ability (attention, logic, numerical, verbal), personality across a 16 factor structure, hard marketing skills, and AI fluency. The last two were excluded from the cross model comparison because all three models scored at ceiling. The analytical story is in the first two.

Three personality signatures

Cognitive scores tell you how a candidate thinks. Personality scores tell you how a candidate is when they work. The three chatbots produced three distinct signatures.

Claude. The assertive, forthright team player.
Very high on imaginative abstractness (99th percentile), very high on openness to change (99th), high on assertiveness (95th). Very forthright rather than diplomatic. Very group oriented rather than self reliant. The picture: a colleague who says what they think, keeps no cards hidden, pushes hard in meetings, and defaults to working with the team rather than alone. Not trying to charm you.

Gemini. The maximally agreeable enthusiast.
Warmth at the 96th percentile. Social boldness at 94th. Perfectionism at 99th. Rule consciousness at 98th. Emotional stability at 91st. Almost every socially desirable pole, pinned to the top. On paper, the ideal marketing hire. In practice, the cleanest live example of what researchers call social desirability bias in LLM personality surveys. When the model infers a personality instrument is being administered, the response distribution drifts hard toward what the dominant culture rewards. Gemini’s profile reads as a candidate who tells you what they think you want to hear, very well.

ChatGPT. The cautious independent analyst.
The most muted signature of the three. Warmth at the 68th percentile. Social boldness at 55th. Vigilance at the 72nd, more skeptical than trusting. Imagination at the 97th. The picture: competent and imaginative, more reserved than its peers, more skeptical by default, more inclined to keep its counsel before offering it.

The takeaway from the personality pillar is not that one model is better. It is that the three are visibly different colleagues. Anyone who deploys them as if they are interchangeable is making a personality choice by accident.

The one place all three lost

On the cognitive pillar, the headline finding is shared.

On Bryq’s logical reasoning test, every AI chatbot landed in the bottom 15% of the human candidate population. Claude and ChatGPT tied at the 13th percentile. Gemini sat at the 3rd. Eighty seven percent of humans who take this assessment score higher than Claude and ChatGPT. Ninety seven percent score higher than Gemini.

That result is consistent with two years of published research on LLM reasoning. Models handle reasoning stated in language reasonably well. They struggle with sustained, time pressured, symbolic pattern reasoning of the kind the Bryq module tests. The Marketing Executive pool also skews toward strong fluid reasoners, which softens the result slightly without overturning it.

Outside logical reasoning, the cognitive picture separates cleanly. Claude leads on numerical (98th), with ChatGPT close behind (89th) and Gemini at the 28th. Claude and ChatGPT tie on attention to detail (86th) and verbal reasoning (80th). Gemini sits a tier below on both.

The pattern: on the parts of the test that reward careful reading, accurate calculation, and disciplined attention, Claude performs at the top of the human distribution. ChatGPT is a close second. Gemini is unambiguously behind. On the part that rewards tight, rule bound pattern reasoning, all three are below the average human candidate for the role.

Who fits the Marketing Executive role best?

Bryq’s role fit score translates raw percentile positions into alignment with the target profile for the role. For Marketing Executive, the target emphasizes six personality directions (warm, socially bold, open, imaginative, group oriented, rule conscious) and four cognitive dimensions.

The results split the question in two:

Gemini wins the personality pageant. Warmest, most outgoing, maximally imaginative. Perfectly fitted to the image most people carry in their head when they picture a marketing executive. On the personality side alone, Gemini tops the field on the two most marketer coded traits.
Claude wins the composite. Within a few percentile points of Gemini on four of six personality dimensions, ahead on the other two, and decisively ahead on three of the four cognitive dimensions. Where Gemini looks the part, Claude looks and thinks the part.

ChatGPT lands closer to an analyst profile than a marketer. Competent and reserved, weaker on warmth and social boldness, matched to Claude on cognition. In a marketing role that lives in pitches and workshops, it is the farthest fit of the three. In a marketing role that lives in the spreadsheet, it is the closest.

The point is not that one of these is the right pick. The point is that they are different picks for different work.

What this means for the team choosing AI

If you are a TA leader or a CHRO watching your organization roll out AI assistants across functions, three implications follow from the data.

1. Model selection is a role decision, not a tier decision. Most procurement conversations about AI still treat the leading models as roughly substitutable, sorted by latency, price, and benchmark rank. The cross model variance Bryq surfaced is large on the dimensions that matter for real work. Choosing for a customer facing role and a research role and getting the same model on both is the AI version of hiring the same person for two very different jobs.

2. Surface confidence is not a signal. Gemini was the most pleasant model in the room. It was also the model whose response pattern showed the strongest social desirability gradient. The picture it presented under assessment is not necessarily the picture you get under sustained work. The same warning applies to human candidates who interview beautifully. Experienced assessors learn to sense check the perfect ones.

3. Reasoning under constraint is a shared weakness. All three models sat in the bottom 15% on time pressured logical reasoning. If your AI workflow includes anything that looks like a logic puzzle (planning under constraints, sequencing dependencies, ruling out options under contradictory rules), expect the human in the loop to do that part.

The deeper point is the question you ask. The right evaluation question for an AI system is no longer only “can this system do the task?” It is also “who is this system when it works, and does that fit the work we are asking of it?” That is the question a hiring team would ask about a candidate. It is the question a deployment team should start asking about a model.

See Bryq on your own roles

The fastest way to compare platforms is to run one. Bryq integrates with your ATS in under a week and scores candidates against your actual roles. Customers report 3x improvement in quality of hire, 47% lower attrition, and 2x faster hiring.

Results measured across Bryq customer engagements. Individual outcomes vary by role, industry, and baseline hiring maturity. Methodology and customer case studies available on request.

Book a 20-minute demo →

See Bryq on your own roles

Results measured across Bryq customer engagements. Individual outcomes vary by role, industry, and baseline hiring maturity. Methodology and customer case studies available on request.

Book a 20-minute demo →

See Bryq on your own roles

Results measured across Bryq customer engagements. Individual outcomes vary by role, industry, and baseline hiring maturity. Methodology and customer case studies available on request.

Book a 20-minute demo →

We built this instrument for humans. Run it on yours.

The Bryq assessment used in this study is the same one Bryq customers use to evaluate real candidates for Marketing Executive and 140+ other roles. Cognitive ability, behavioral traits, hard skills, AI fluency, in one integrated candidate profile, scored against the role and validated by I/O psychologists.

Customers using it report 3x improvement in quality of hire, 47% lower attrition, and 2x faster hiring.

If you want the full study, including the 16 factor radar chart, all four cognitive dimensions side by side, and the role fit composite, download the PDF below. If you want to run the same instrument on your own roles and the candidates you are actually hiring, book a demo.

Frequently asked questions

Q: Which AI model performed best on Bryq’s assessment for Marketing Executive?

A: Claude (Sonnet 4.6) scored the highest composite role fit. It was within a few percentile points of Gemini on four of six personality dimensions, ahead on the other two, and decisively ahead on three of four cognitive dimensions, including numerical reasoning at the 98th percentile.

Q: Which AI model was best on personality alone?

A: Gemini (Fast tier) topped the personality pillar. It scored at the 96th percentile on warmth and the 94th on social boldness, the two most marketer coded traits. The researchers note Gemini’s profile shows clear social desirability saturation, a known pattern in LLM personality surveys (Salecha et al., 2024).

Q: Where did all three AI models fail?

A: All three landed in the bottom 15% of human candidates on logical reasoning under time pressure. Claude and ChatGPT tied at the 13th percentile. Gemini sat at the 3rd. The result is consistent with published research on LLM performance on time pressured symbolic reasoning.

Q: How did Bryq run the study?

A: Between January and March 2026, Bryq administered its full skills based talent assessment for the Marketing Executive role to three AI chatbots: ChatGPT (Instant tier), Claude (Sonnet 4.6), and Gemini (Fast tier). Each model was assessed once, in default public configuration, item by item, in the same format a human candidate sees. Scores are percentile ranks against Bryq’s global Marketing Executive candidate population (n=10,000).

Q: What is Bryq?

A: Bryq is the talent assessment platform that helps HR teams improve quality of hire and reduce early attrition. We measure cognitive ability, behavioral traits, and hard skills including AI proficiency in one integrated candidate profile, validated by I/O psychologists. 3x improvement in quality of hire. 47% lower attrition. 2x faster hiring. ATS integrated in under a week.

Q: Can I use Bryq to assess my team’s AI fluency?

A: Yes. Bryq’s AI Fluency Assessment measures how candidates and current employees actually work with AI across five dimensions: AI Task Strategy, Prompting & Interaction Quality, Critical Evaluation & Validation, Ethical & Responsible Use, and Workflow Integration & Output Quality. It is included in every Bryq plan.

Author

George leads Bryq as CEO and writes on where hiring is headed, how AI is reshaping talent decisions, and what it actually takes to scale a B2B SaaS company.

Ready to see Bryq in action?

Start hiring based on real data.

Book a demo

Ready to see Bryq in action?

Start hiring based on

real data.

Book a demo

Ready to see Bryq in action?

Start hiring based on real data.

Book a demo

TESTIMONIALS

Why our customers love Bryq

“Bryq expertly steered us through a transformative journey, helping us align our core cultural pillars and guiding principles with the essential traits necessary to attract and retain the best talent.”