Robot Dexterity Has a Benchmarking Problem

Robot dexterity may be advancing rapidly, but the field still lacks the common yardsticks needed to measure progress.

Jun 05, 2026

"If you can't measure it, you can't improve it."— Peter Drucker

Robot dexterity is having a moment. Manipulators are folding laundry, sorting parcels, picking fragile tomatoes off a line, and handing tools to workers on factory floors. Yet for all the impressive demos, one basic question remains surprisingly difficult to answer: which robot is actually better?

Ask two labs whether their robots are good at any of these tasks, and you quickly run into a problem that has nothing to do with the robots themselves. The two teams may not mean the same thing by “grasp,” “success,” or even “dexterity.” They almost certainly did not test under conditions anyone else could rebuild. And neither result necessarily tells you which system performed better.

Closing that gap is its own kind of work. It happens long before any standard gets written. It is the slow, unglamorous business of getting a community to agree on what the words mean, what a fair test looks like, and how anyone is supposed to reproduce anyone else’s result.

Increasingly, that work is being described as pre-standardization—the process of building shared definitions, test methods, data formats, and benchmarks before a formal standard can even be written.

To see what that process actually involves, it helps to watch a field in the middle of it. One such effort—the COMPARE Ecosystem, a National Science Foundation-backed community organized around robot manipulation—offers a useful case study because it has spent the last few years doing exactly this and recently gathered contributors from across industry, academia, government, and standards organizations to work through the challenges together.

Adam Norton, co-Director of the NERVE Center, leads the COMPARE group at ICRA 2026 in Vienna

The three-legged stool: reproducibility, benchmarking, and standards

One of the most useful ideas to emerge from that gathering was a simple way of understanding how three concepts depend on one another. They are often treated as separate concerns. In practice, they are not.

Start with reproducibility. The everyday rhythm of research is that one lab publishes a result—we ran this protocol with this setup and got these numbers—and another lab tries to reproduce it, then builds on top of it. Reproduce, leverage, advance.

But reproducibility is not one thing; it comes in layers.

First, there is the context: can another team assemble the same physical setup, hardware, software environment, and task conditions? Then there is functionality: can they get the pipeline to run and the system to behave as intended? Only then comes the most difficult layer—reproducing the actual performance, the numbers themselves.

Benchmarking sits on top of reproducibility, and here is the dependency many people miss: you cannot meaningfully benchmark against something you cannot reproduce. A leaderboard built on results no one else can recreate is ultimately just a list of claims. If you want fair benchmarking, you first need reproducibility.

And reproducibility, in turn, needs standards. It is not enough to post code on GitHub and hardware files in a repository and declare the work reproducible. Anyone who has attempted to rebuild a research result knows it rarely works that way. You need agreed data formats, performance thresholds, and benchmarking protocols that specify how a test is set up, executed, and scored. Standards are what make those assets usable by someone who was not part of the original lab.

Put the three together, and they form a stool. Standards enable reproducibility. Reproducibility enables benchmarking. Benchmarking is what finally allows two groups to compare results honestly.

Remove any leg, and the entire structure falls over. Standards without reproducibility become paperwork. Reproducibility without benchmarking becomes documentation. Benchmarking without either becomes marketing.

This is the architecture underneath the phrase pre-standardization—the recognition that you cannot jump straight to a standard because a standard is ultimately the codified endpoint of a chain that must first be built from the ground up.

COMPARE Team Members break into smaller groups around topics

The tools exist—but they weren’t built for this

A reasonable objection is that the field already has plenty of shared resources.

And it does.

There are widely used object sets for manipulation research, open-source tactile sensors, robotic hands and arms, simulation environments, and the familiar reproducibility infrastructure of ROS, GitHub, and Docker. Some benchmarking efforts even attempt to run evaluations across multiple sites.

The challenge, raised repeatedly by participants in the room, is that most of these tools were built for manipulation in general, not for dexterity specifically.

They may exercise dexterous behavior incidentally, but they were not designed to isolate and measure dexterity itself. Determining what dexterity actually requires—and what hardware, tasks, and metrics can expose it—is precisely the gap the community is now trying to define.

The assets are necessary but not sufficient. An object set that can be downloaded is not the same thing as an agreed method for testing the capability researchers care about.

In the world of formal standards, the shortfall is even more apparent. While robotics has accumulated a growing collection of safety standards, there remain relatively few specifications, guides, or test methods focused specifically on manipulation performance. Much of the work needed to evaluate dexterity still lies ahead of the field rather than behind it.

Where consensus is genuinely missing

One of the clearest signals of an early-stage field is when participants independently identify the same missing piece.

At this gathering, one of those pieces was tactile sensing data.

Many groups build tactile sensors. The challenge is that they often record different raw data, measure different physical variables, and output information in entirely different formats. As a result, there is no common language for expressing what a sensor actually felt.

A funder in the room described the resulting chicken-and-egg problem plainly. Sensor developers want their technology adopted. Potential users want to incorporate tactile sensing into their systems. But because there is no shared way to represent or interpret the data, both sides struggle to move forward.

An academic center representing dozens of industry partners reported encountering the same issue from the commercial side. Many companies were producing sophisticated sensing technologies, yet each was generating data that was difficult to compare or integrate with others.

What was notable was not the disagreement. It was the convergence.

Researchers, companies, and institutions from different countries and sectors arrived at the same conclusion: before the field can meaningfully compare tactile sensing systems, it needs a common vocabulary and a common representation of tactile information.

That convergence is what pre-standardization feels like from the inside.

Nobody is asking for a finished standard yet. They are asking for the things that must exist first—shared definitions, common representations, agreed metrics, and a way to make one lab’s measurements intelligible to another.

A lesson that extends beyond manipulation

While the discussion focused on manipulation, the challenge is hardly unique to dexterity.

Humanoid stability, public-facing mobile robots, agricultural autonomy, human-robot interaction, and many other areas of robotics face similar questions. Before industries can compare systems, certify performance, establish procurement requirements, or develop standards, they must first agree on how performance should be measured.

Manipulation is simply one of the clearest examples of a pattern that appears across emerging robotics domains.

In many cases, the technical bottleneck is not that engineers do not know how to build the next capability. It is that the community has not yet agreed on how to evaluate it.

The shape of the work ahead

Watching a community at this stage, you can see the scaffolding it naturally reaches for.

It wants open-source hardware and software so that “the same setup” can exist in more than one building. It wants shared repositories where components and results survive beyond a single publication. It wants test facilities and round-robin evaluations because the strongest evidence that a result is real is that a different lab can reproduce it.

And eventually, it wants the involvement of standards organizations that can take community-built consensus and transform it into something durable, recognized, and broadly adopted.

That handoff is worth understanding correctly.

A standards body rarely invents a standard from scratch. More often, it provides the process and forum through which community consensus is refined, challenged, documented, and ultimately codified.

Researchers, developers, users, and industry practitioners remain the subject-matter experts. The standards process helps convert their collective experience into a framework that others can use and trust.

If the vocabulary is not settled and the benchmarks are not genuinely comparable, there is little solid to codify. The early work is not merely preparation for a future standard.

In many respects, it is the standard—just in unfinished form.

Why the boring part matters most

It is tempting to treat definitions, data formats, and test protocols as bureaucratic throat-clearing before the real engineering begins.

History suggests the opposite.

Computer vision accelerated dramatically because shared datasets and common benchmarks gave thousands of researchers a common yardstick. Progress became measurable. Results became comparable. Competition became meaningful.

Manipulation has not yet reached that point. Increasingly, the people closest to the problem identify the lack of shared benchmarks and reproducible evaluation—not any single hardware limitation—as one of the field’s most significant constraints.

For researchers working on dexterity today, the practical implications are straightforward. Report setups and test conditions in enough detail that another lab can rebuild them. Adopt shared objects and protocols where they exist instead of creating private alternatives. Contribute components and results back to common repositories. Treat tactile data as something intended for sharing, not just local consumption. And resist the temptation to compare results directly with another lab’s numbers before first reproducing the underlying experiment.

The headlines will continue to celebrate robots performing increasingly impressive feats. But a mature field needs more than impressive robots. It needs a way to compare them.

The real breakthrough may not be a new hand, a new sensor, or a new foundation model. It may be the moment the community agrees on what success actually means.

That work rarely makes the front page. It looks like researchers, companies, funders, and standards experts are sitting around a table arguing about definitions. Yet history suggests those conversations are often what make the next generation of breakthroughs possible.

Robot News Of The Week

Festo launches lightweight pneumatic gripper and tests GripperAI

Festo is tackling two of industrial robotics' biggest challenges—end-of-arm tooling complexity and flexible picking—with a pair of new solutions aimed at collaborative automation. Its HPPH pneumatic gripper integrates controls, sensing, and safety functions directly into the gripper body, reducing weight, wiring, and installation complexity for cobot applications while supporting force-limited operation aligned with collaborative safety requirements. Meanwhile, GripperAI uses AI-driven grasp planning to enable robots to pick previously unseen objects without programming or teach-in training. Together, the technologies point toward a future where deploying robotic picking systems becomes faster, simpler, and more adaptable across logistics, manufacturing, and packaging environments.

Path Robotics launches Rove mobile welding platform powered by Obsidian physical AI model

Path Robotics is taking robotic welding beyond the factory cell with the launch of Rove, a mobile welding system that combines the company’s Obsidian physical AI platform with a quadruped robot. Designed for shipyards, heavy construction sites, and large-scale fabrication environments where moving massive workpieces is impractical, Rove brings autonomous welding directly to the job. The system challenges long-held assumptions about the stability of legged robots for precision industrial work, using AI-driven perception and adaptability to navigate high-variability environments. Early adopters, including maritime manufacturer Saronic, see the technology as a potential step toward modernizing labor-constrained industries and expanding the reach of welding automation.

Robot Research Of The Week

Consistency, not complexity, is the key to teaching robots dexterity, new research suggests

Researchers have shown that when teaching robots complex manipulation skills, better data may matter more than more data. A team from NYU Tandon and the Robotics and AI Institute found that robots learned dexterous tasks such as in-hand object manipulation and coordinated dual-arm movements more effectively from consistent, carefully structured demonstrations generated by planning algorithms than from highly variable examples. The approach enabled robots to achieve near-perfect performance in simulation and strong real-world results without additional training, highlighting a growing convergence between classical motion planning and AI. The findings suggest that improving the quality of synthetic training data could be a key step toward more capable and adaptable robotic manipulation systems.

Teaching Robots to Harvest Tomatoes Without Touching a Tomato

Researchers in Japan are building virtual tomato farms to train harvesting robots without collecting massive amounts of real-world data. By generating synthetic greenhouse environments and automatically labeled images, the approach could accelerate agricultural robotics development, reduce costs, and demonstrate how digital twins are becoming a critical tool for the future of farming.

Robotics Profile of the Week: Ghost Robotics’ Gavin Kenneally

Ghost Robotics co-founder and CEO Gavin Kenneally has spent the last decade pursuing a different vision for legged robotics. While much of the industry chased viral videos, humanoids, and billion-dollar valuations, Ghost focused on building rugged quadrupeds that solve real-world problems. From military installations and disaster response to construction sites, oil fields, and critical infrastructure, the company’s Vision 60 robots are designed to keep humans out of harm’s way. In this profile, Kenneally discusses the lessons of sustainable growth, why Ghost is avoiding the humanoid race, and how practical deployments—not internet fame—are shaping the future of robotics. Read it here.

Robot Video Of The Week

Deep Robotics is continuing to push its humanoid ambitions forward with upgrades to its DR02 humanoid robot, improving both payload capacity and obstacle-crossing performance. The enhancements are aimed at helping the robot tackle more demanding industrial tasks and operate in increasingly complex real-world environments. As humanoid developers race to move beyond technology demonstrations, Deep Robotics says each iteration of DR02 brings the industry closer to practical deployments where humanoids can serve as productive tools rather than experimental showcases.