OpenAI Tests AI Models on Real Work Tasks on the Path to AGI
How OpenAI tests whether AI can replace human professionals with real work tasks – and what legal risks emerge
OpenAI Tests AI Models on Real Work Tasks on the Path to AGI
New Testing Method: AI vs. Human Labor
OpenAI has developed a new approach to evaluate the performance of its AI models under realistic conditions. The company is asking freelancers to upload real work assignments from current or previous jobs. The goal is to verify whether the new AI models can actually replace human workers.
Since September 2024, this new evaluation process has been running, directly comparing the performance of AI models with the work of human professionals. OpenAI describes this approach as an important indicator of progress toward AGI (Artificial General Intelligence) – a status where AI surpasses real employees in most economically relevant tasks.
How the Test Works
According to a document from OpenAI and the training data company Handshake AI, freelancers must submit two things:
- The specific task assignment from their clients
- The corresponding work result (e.g., Word document, PDF, or Excel file)
OpenAI explicitly emphasizes that these must be actual results, not just a description of the task. This is meant to verify whether the results of the new AI models are qualitatively comparable to human work.
Example from Practice
A specific example comes from the field of a Senior Lifestyle Manager at a luxury concierge company: The task was to create a two-page PDF draft for a seven-day yacht trip to the Bahamas, including information about the traveling family's interests and the planned itinerary.
Sobering Results
Despite enormous progress in large language models, significant differences remain: The so-called "Remote Labor Index" finds that even the most powerful tested model can only satisfactorily complete about three percent of tasks.
Legal Risks: Intellectual Property and Confidentiality
The method carries significant legal risks. Evan Brown, an intellectual property attorney at the US law firm Neal & McDevitt, warns:
- AI labs could face lawsuits for misappropriation of trade secrets
- Freelancers could get into legal trouble if they share documents from previous employment relationships – even if previously sanitized
While OpenAI explicitly requires removing employers' intellectual property and personal data, and documents mention an internal ChatGPT tool called "Superstar Scrubbing" to provide guidance on removing sensitive information, sanitized documents could still fall under non-disclosure agreements.
Billion-Dollar AI Training Market
The demand for high-quality training data has increased massively in recent years. AI companies like OpenAI have been relying on third-party providers like Surge, Mercor, or Scale AI for years to build networks of data contractors. Higher requirements lead to correspondingly higher compensation and have developed an extremely lucrative niche:
- Handshake AI valued its company at around $3.5 billion in 2022
- Surge was reportedly valued at $25 billion in summer 2025
Conclusion: Still a Long Way to AGI
OpenAI's test clearly shows: Despite impressive progress in large language models, the path to true AGI is still far. Only three percent successful task completion demonstrates that while AI models can excel in specific areas, they are still far from comprehensively replacing human workers.
For businesses, this means: AI should be understood as a support tool, not as a complete replacement for human expertise. The combination of AI efficiency and human judgment is likely to remain the most promising approach for the foreseeable future.
A Different Approach: Aligning AI Directly with Company Processes
OpenAI's test reveals a central challenge: Generic AI models often cannot satisfactorily solve real work tasks because they lack specific company context. A more promising approach is to equip AI agents directly with a company's knowledge and processes.
With this approach, AI agents are not tested in isolation, but are tailored from the start to an organization's specific requirements and workflows. Company knowledge – from documentation, guidelines, or best practices – flows directly into the agent configuration. This creates customized assistants that are not just theoretically capable, but actually work in daily operations.
Structured introduction programs, such as our 100-Day Introduction Program, enable companies to approach this process systematically: From identifying suitable use cases to integrating company knowledge to gradually embedding AI in daily work.
Interested in Secure AI Use in Your Organization?
At Evoya AI, we offer Swiss AI solutions that combine data protection and practical applicability. Contact us for a non-binding initial consultation or test our platform for free.