Noometric — Measuring Machine Intelligence with Human Science

Thesis

The Measurement Problem
Nobody Recognizes

The AI industry has an evaluation problem, and it doesn't know it — because the people building the evaluations have never studied measurement.

AI evaluation today is built almost entirely by software engineers and computer scientists. They've brought the tools they know: unit tests, benchmarks, pass/fail assertions, leaderboard rankings. And these tools work — for software. But AI systems aren't software in the traditional sense. When an LLM extracts entities from a news article, when a model classifies sentiment, when a system generates a clinical recommendation — these aren't deterministic operations with correct outputs. They're acts of judgment. And judgment is a measurement problem.

The reason is straightforward: the AI world was built by software engineers, computer scientists, and network administrators. They treat evaluation like a software testing problem because that's all they've ever known. Most have never taken Psych 101, let alone encountered the field of psychometrics. Even the linguists who make it into the field tend to be more mathematicians than behavioral scientists.

Psychometrics — the science of psychological measurement — has spent over a century developing rigorous frameworks for exactly this class of problem. Not in theory. In practice, at enormous scale, with real consequences: college admissions, clinical diagnoses, personnel selection, criminal risk assessment. When the question is "how do we know this instrument measures what we think it measures, reliably, fairly, and with quantifiable confidence?" — psychometrics wrote the book. Several books.

The parallels are not subtle. When an AI team asks "does this benchmark actually test reasoning?" — they're asking about construct validity, a question psychometrics formalized before computers existed. When scores fluctuate between runs and nobody knows why, that's a reliability problem with well-established measurement frameworks behind it. When a model performs worse on certain demographics, psychometrics has a name for that — Differential Item Functioning — and decades of methodology for detecting and correcting it. When teams struggle to determine which test cases in a massive evaluation suite actually carry signal, Item Response Theory has been answering that question since the 1960s.

These aren't analogies. They're the same problems, discovered independently by an industry that doesn't know the prior art exists.

This is not unfamiliar territory. When software discovered Agile, it felt like a revolution — standups, self-organizing teams, iterative delivery. Organizational psychologists recognized it immediately. Self-managing work groups had been studied since the 1950s at the Tavistock Institute. The principles weren't new; the vocabulary was. The industry found what it needed, adopted it, and moved forward. That's fine. But the adoption was slower and rougher than it had to be, because nobody looked sideways at a field that had already done the foundational work.

AI evaluation is at that same inflection point. The frameworks exist. The methodology is proven. The question is whether the industry will spend another decade reinventing them from scratch, or build on what's already there.

Noometric exists to bridge that gap — to bring psychometric rigor to AI evaluation and cognitive science to knowledge engineering. Not as an academic exercise, but as applied practice: real evaluations, on real systems, producing real measurements that mean something.

Projects

Applied Research

Live

NewsAnalyzer

An NLP pipeline that ingests political news, extracts entities, and analyzes sourcing patterns across government branches. Its real purpose is as a living testbed for AI evaluation methodology — entity extraction evaluation, bias auditing, and model comparison all run against real-world data, not sanitized benchmarks.

Java / Spring Boot Python / FastAPI TypeScript / Next.js Promptfoo

View live →

Coming Soon

EvidenceGraph

A neuro-symbolic reasoning system that maps extracted entities and claims onto a formal ontology and uses symbolic reasoning to flag contradictions in the knowledge structure. An automated investigative evidence board — neural networks extract, symbolic logic reasons, and the knowledge graph holds it all together.

Neo4j OWL / RDF PyReason Python

About

Where Behavioral Science
Meets Engineering

The question I get most is "how did you end up here?" — as if psychometrics and AI evaluation occupy different planets. They don't. They occupy the same planet. One just got there first.

I spent 15+ years as a test engineer building automation frameworks, performance testing systems, and CI/CD quality gates across business analytics, healthcare, network security, IoT, and CRM. I have a degree in Experimental Psychology from CSULB and a Health Informatics certification from UC Davis. For most of my career, the psychology background was a footnote — interesting at parties, irrelevant on the job.

Then the industry started trying to evaluate AI systems, and suddenly everything I learned about measurement, cognition, and human judgment became the most relevant thing on my resume.

Noometric is where those two threads converge. The engineering discipline to build evaluation systems that work in production. The behavioral science training to make sure they measure what they claim to measure.

Measuring machine
intelligence with
human science

The Measurement Problem
Nobody Recognizes

Applied Research

NewsAnalyzer

EvidenceGraph

Where Behavioral Science
Meets Engineering

Get in Touch

Measuring machineintelligence withhuman science

The Measurement ProblemNobody Recognizes

Applied Research

NewsAnalyzer

EvidenceGraph

Where Behavioral ScienceMeets Engineering

Get in Touch

Measuring machine
intelligence with
human science

The Measurement Problem
Nobody Recognizes

Where Behavioral Science
Meets Engineering