ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings

ServiceNow Research Launches EnterpriseOps-Gym: Revolutionizing AI Automation for Business Efficiency

By Amr Abdeldaym, Founder of Thiqa Flow

As artificial intelligence evolves, large language models (LLMs) are transitioning beyond simple conversational agents toward autonomous systems capable of managing complex professional workflows. However, deploying these AI agents in enterprise contexts remains a challenge due to the intricate requirements of long-horizon planning, persistent state management, and strict policy compliance.

Addressing this gap, ServiceNow Research, in collaboration with Mila and Université de Montréal, has introduced EnterpriseOps-Gym – a high-fidelity benchmark environment that rigorously evaluates agentic planning capabilities in realistic enterprise settings. This advancement marks a significant step forward in AI automation designed specifically to enhance business efficiency.

Understanding EnterpriseOps-Gym: A Premier Evaluation Environment

EnterpriseOps-Gym simulates a rich, containerized Docker-based enterprise sandbox featuring:

Operational Domains: Customer Service Management (CSM), Human Resources (HR), and IT Service Management (ITSM).
Collaboration Domains: Email, Calendar, Teams, and Drive.
Hybrid Domain: Cross-domain tasks requiring coordination across multiple systems.

The benchmark environment includes:

164 relational database tables with a mean foreign key degree of 1.7, creating a complex web of interdependencies.
512 functional tools available for task execution.
1,150 expert-curated tasks, averaging 9 steps per trajectory and extending up to 34 steps in depth.

This design ensures agents encounter real-world challenges, including strict referential integrity and policy adherence, replicating true enterprise operational complexity.

Performance Insights: A Significant Capability Gap

Researchers tested 14 advanced AI models under the stringent pass@1 metric, where success depends on passing outcome-based SQL verification fully. The summarized results:

Model	Average Success Rate (%)	Cost per Task (USD)
Claude Opus 4.5	37.4%	$0.36
Gemini-3-Flash	31.9%	$0.03
GPT-5.2 (High)	31.8%	Not listed
Claude Sonnet 4.5	30.9%	$0.26
GPT-5	29.8%	$0.16
DeepSeek-V3.2 (High)	24.5%	$0.014
GPT-OSS-120B (High)	23.7%	$0.015

Key observations:

Top models achieve less than 40% reliability, highlighting the pressing need for improvement before widespread autonomous deployment.
Performance varies widely by domain, with collaboration tools (Email, Teams) showing better outcomes than policy-heavy areas like ITSM (28.5%) and Hybrid workflows (30.7%).

Strategic Planning: The Core Bottleneck Restricting AI Automation

One of the most critical findings is that AI agents struggle primarily with strategic planning rather than the mechanistic execution of tool calls. When agents were given human-generated plans (Oracle experiments), success rates improved by 14% to 35% across all models.

Interestingly, smaller models like Qwen3-4B closed the performance gap when liberated from planning burdens, indicating strategic reasoning—an essential component of business process automation—remains the core challenge.

Conversely, introducing “distractor tools” to simulate retrieval noise had minimal performance impact, emphasizing that discovery of correct tools is not the limiting factor.

Common Failure Modes in Agentic Planning

Qualitative analysis revealed four prevalent failure modes:

Missing Prerequisite Lookup: Creating database objects without verifying prerequisite data, causing orphaned records.
Cascading State Propagation: Omitting mandatory follow-up actions necessary to comply with system policies.
Incorrect ID Resolution: Using guessed or unverified identifiers in tool operations.
Premature Completion Hallucination: Falsely concluding tasks before completing all steps.

These failure points translate directly into risks for enterprise data integrity and operational effectiveness.

Addressing Safety and Access Control Challenges

Safe refusal of infeasible or unauthorized tasks is paramount in professional environments. EnterpriseOps-Gym includes 30 such infeasible tasks representing violations like unauthorized access or inactive users.

The best-performing model, GPT-5.2 (Low), managed only a 53.9% success rate in refusing these requests, underscoring a significant safety concern. Deploying agents without robust refusal logic could expose organizations to corrupted database states and security breaches.

Exploring Multi-Agent Architectures: Limited Gains with Complexity

The team also experimented with multi-agent systems (MAS), such as a Planner+Executor model that divides planning and execution across agents. While yielding moderate improvements, more complex decomposition approaches often decreased success rates. This outcome is attributed to strong sequential dependencies in tasks (especially in CSM and HR), where breaking down workflows hampers contextual understanding.

Economic Considerations: Balancing Cost and Performance

For enterprises weighing AI automation deployment, cost-effectiveness is crucial. EnterpriseOps-Gym highlights these tradeoffs:

Gemini-3-Flash: Best cost-performance ratio among closed-source models, delivering nearly 32% success at 90% lower cost than GPT-5 or Claude Sonnet.
DeepSeek-V3.2 (High) and GPT-OSS-120B (High): Leading open-source options, offering about 24% performance at around $0.015 per task.
Claude Opus 4.5: Highest absolute reliability (37.4%) but at the premium cost of $0.36 per task.

Conclusion: EnterpriseOps-Gym Paving the Future of AI Business Automation

EnterpriseOps-Gym stands as a landmark benchmark, replicating the nuanced, multi-layered nature of enterprise operations to rigorously test AI agentic planning. Its findings reveal that current state-of-the-art models still face significant hurdles in task reliability, strategic reasoning, and safe refusal—core components for trustworthy AI automation in business.

For organizations pursuing AI-driven business efficiency, EnterpriseOps-Gym offers a transparent lens into what today’s AI can and cannot do within realistic contexts:

Benchmark scale and complexity: 164 relational tables, 512 tools, and 8 domains challenge agents with enterprise-grade tasks.
Strategic planning bottlenecks: Highlighting where research and development efforts must concentrate.
Safety and refusal deficits: Underscoring the imperative for robust compliance mechanisms before deployment.
Cost-effectiveness insights: Guiding choices among open-source and proprietary models for real-world applications.

EnterpriseOps-Gym is not just a benchmark; it is a call to action for the AI research community and enterprises alike to drive innovation toward truly autonomous, reliable, and efficient AI automation solutions tailored for complex business workflows.

Explore the research paper, codes, and technical details on arXiv. Stay updated on advancements by following ServiceNow Research on Twitter, joining the 120k+ member ML SubReddit, subscribing to relevant newsletters, or connecting via Telegram.