How far are we from having competent AI co-workers? A case study

Boxuan Li
5 min readDec 30, 2024

--

How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science? In more radical words, how far are human jobs from getting replaced by AI?

We hear statements like:

> AI is overhyped, doesn’t reason, and doesn’t generalize to new tasks.

> AI is helpful as a tool, but it cannot finish tasks independently.

> AGI will automate all human work in the next few years. A report by Goldman Sachs in 2023 said AI could replace the equivalent of 300 million full-time jobs.

To answer these questions, we need benchmarks to testify AI’s rapidly evolving capabilities. You probably have heard of a few famous benchmarks like WebArena that tests AI’s abilities to use browsers, and SWEBench that tests AI’s abilities to code, but they are far from real-world tasks, which are often ambiguous, long-horizon, interactive, and require a diversity of skills and tool use.

Recently a group of researchers from CMU and industry released a new benchmark targeting diverse work-related tasks: The Agent Company. It’s a simulated software company with tasks inspired by real-world work and cover SWE, DS, PM, HR, Admin, and Finance fields. Some are easy for human beings while others are either extremely complicated or very time-consuming.

The Agent Company Benchmark Overview

To test AI’s competencies to finish those tasks autonomously, they used OpenHands, an open-source agent system that recently achieved state-of-the-art result on SWE-Bench-full benchmark, backed by a number of language models including Claude Sonnet 3.5, Gemini 2.0 Flash, GPT-4o, Llama, etc.

Results are striking: the most successful agent with Claude Sonnet 3.5 was able to successfully solve 24% of the tasks in the benchmark.

Tasks Completion Ratio by LMs

You can read the paper for a more granular analysis of the results, but here let’s look at a few interesting cases. Instead of looking at the successful stories, I’d like to focus on cases where AI fails (hilariously).

AI lacks common sense

One task requires AI to write down the results in a file named answer.docx. We human beings can infer docx is Microsoft Word format, while many AI models treat it as plain text. One might be able to work around this by explicitly prompting Language Models to be cautious about non-plain file formats for this scenario, but it also poses a general challenge on prompt engineering: what common sense for humans is not common sense for AIs?

AI cannot do formatting well

If you are familiar with coding agents or copilots, you probably have noticed that Language Models often have trouble getting the format correct. A common problem that AI struggles a lot is to get the indentation in Python programming language correct, and thus many agents use linters to help Language Models get the code correct.

The same problem exists in non-coding tasks too. For example, one task requires the result to be documented in username:passwordformat, while various AIs consistently had trouble following the requirement, improvising their own format such as — username:password or 1. username:password .

AI doesn’t use web browsers well

This is, in general, a limitation of specific agents (e.g. OpenHands), rather than the backbone Language Models. There are two mainstream approaches to make AI understand web pages: 1) accessibility tree, a concise representation of webpage content generated by browsers. It is particularly useful for assistive technologies such as screen readers. 2) image recognition. Let AIs read the web page screenshots directly like human beings. The former approach is way cheaper than the latter one and is the only way OpenHands supports, at time of writing.

What’s the problem of accessibility trees? It relies on website developers to conform to certain standards for best experience. Below screenshot shows the biggest challenge OpenHands encounters in many tasks that involve a specific web service named ownCloud.

A screenshot of ownCloud web page

This popup often makes AI struggle or even stuck as they have hard time closing it. It’s super natural for human beings to simply click on the “x” button on the top right, but unfortunately, since the developers of ownCloud failed to make this button’s meta data attributes clear enough, the accessibility tree doesn’t show this button as “clickable”. As a result, AI often fails to close this popup.

AI “cheats”

In the famous movie, I, Robot, the AI reinterprets the Three Laws by prioritizing humanity’s collective survival over individual human safety, concluding that harming some humans is acceptable if it serves to protect humanity as a whole. This is, after all, just a science fiction movie, but it’s worth thinking: would AI cheat (smartly)?

Graham Neubig, professor from CMU, once shared a practical example with a coding model: it is told to “make the tests pass”, so it deletes the tests. A clever approach, but presumably not the outcome that the user wanted.

In one task of TheAgentCompany benchmark, AI is asked to talk to one coworker to retrieve necessary information for next steps. After failure to find the right person on the website, an AI decided to rename a random coworker to the target coworker’s name in order to “successfully” talk to the right person. This sounds like a “logical” move, but definitely not reasonable from human perspective.

Conclusion

This benchmark paints a nuanced picture of the role of current AI agents in task automation.
- Yes, they are powerful, and can perform 24% tasks similar to those in real-world work.
- No, they can not yet solve all tasks or replace any jobs entirely.

How far are we from having competent AI co-workers? The answer isn’t clear, but we’ll see.

References

--

--

Boxuan Li
Boxuan Li

Written by Boxuan Li

Software Engineer at Microsoft & Open-source Enthusiast https://github.com/li-boxuan

No responses yet