Survey: Enterprises Say They Are Ready for Agentic AI Failures, but Few Test Recovery Often

Most enterprise organizations express high confidence in their ability to recover from disruptions involving agentic artificial intelligence, yet a new comprehensive survey of over 300 IT decision-makers across Australia, New Zealand, Europe, the United Kingdom, and the United States reveals a significant disconnect: a mere fraction regularly test those recovery plans to validate their efficacy. This finding underscores a critical vulnerability in modern business continuity strategies as organizations increasingly integrate autonomous AI systems into their core operations.

The alarming insights come from a survey conducted by Keepit, a Denmark-based vendor-independent cloud backup and recovery service. The study, titled "Data Report 2026," highlighted that a striking 94% of respondents were confident their disaster recovery (DR) plans adequately covered agentic AI systems. However, this high level of assurance is undermined by testing frequencies, with only 32% reporting that they test these crucial plans on a monthly basis. This disparity suggests a potential overestimation of readiness, leaving many enterprises exposed to significant operational and financial risks should an AI-related incident occur.

The Rise of Agentic AI and Its Unique Challenges

Agentic AI systems represent a new frontier in artificial intelligence, characterized by their ability to act autonomously, make decisions, and achieve predefined goals without constant human intervention. These systems leverage advanced machine learning models, often operating across various enterprise applications and data repositories, from automating customer service and supply chain logistics to optimizing financial trading and cybersecurity defenses. Their appeal lies in their potential for increased efficiency, innovation, and scalability.

However, the very autonomy that makes agentic AI powerful also introduces unique and complex failure modes. Unlike traditional software systems, agentic AI can exhibit emergent behaviors, make unforeseen decisions, or suffer from data poisoning, model drift, or adversarial attacks that compromise its integrity and functionality. A failure in such a system can have cascading effects, leading to widespread data corruption, service outages, regulatory non-compliance, and severe reputational damage. The integration of these intelligent agents into critical infrastructure necessitates a robust and frequently tested recovery framework, far beyond the scope of conventional IT disaster recovery.

Survey: Enterprises Say They Are Ready for Agentic AI Failures, but Few Test Recovery Often -- Campus Technology

A Troubling Gap: Confidence Versus Control and Preparedness

Beyond the testing frequency, the Keepit survey unearthed more concerning statistics regarding control and certainty. A substantial 33% of IT and security leaders admitted to having only partial control over the use of agentic AI within their organizations. This lack of full oversight can lead to shadow AI implementations, unmonitored deployments, and unmanaged risks, making comprehensive disaster recovery planning exceptionally challenging. Furthermore, 52% of respondents expressed doubts about whether their existing recovery plans truly encompass all possible agentic AI failure scenarios. This half-hearted confidence points to a fundamental uncertainty at the heart of enterprise AI adoption.

Kim Larsen, Group Chief Information Security Officer at Keepit, emphasized the gravity of these findings. "Organizations need to put more emphasis on creating long-term, structured, and tested disaster recovery plans," Larsen stated in a press release accompanying the report. "This also means putting a spotlight on data governance and accountability, which is the foundation for any resiliency plan." His comments highlight the systemic issues at play, where technical solutions must be underpinned by robust governance and clear lines of responsibility.

Inconsistent Testing and Overlooked Critical Systems

The report delved deeper into the nature of existing recovery testing, revealing inconsistencies that further compound the risk. While approximately 90% of organizations reported having evaluated large-scale data recovery at least once, this testing is neither frequent nor systematic across all critical systems. This ad-hoc approach means that while an organization might be able to recover a major database, it might struggle with interconnected systems or less frequently accessed, yet equally vital, components.

A particularly alarming finding was the neglect of identity and access management (IAM) systems in recovery planning. Identity-related systems, such as Microsoft’s Entra ID (formerly Azure Active Directory) and Okta, are foundational to modern enterprise security, controlling access to virtually all applications and data. Despite their criticality, the survey found that these systems are tested far less often than other data systems, particularly productivity applications. On average, productivity applications like Microsoft 365, Google Workspace, and Salesforce are restored four times as frequently as identity applications. The report starkly illustrates this imbalance: "For every four companies who run a yearly test on their productivity workload, only one of them (25%) will have run a test on their identity applications." A compromised or unrecoverable IAM system can effectively lock an organization out of its entire digital infrastructure, making recovery of other data systems moot.

The survey also observed that most restore activity involves single-file downloads. While this reflects routine operational needs and the practical efficiency of retrieving specific files for daily tasks, it suggests that enterprises are more adept at granular recovery than at orchestrating large-scale, multi-system restoration events, which would be typical in an agentic AI system failure. The report’s authors underscored that the true value of backup lies in the ability to recover confidently, correctly, and efficiently, whether the need is small and immediate or broad and time-critical. While larger organizations showed stronger restore activity, the overall pattern indicated a readiness for routine operational issues rather than catastrophic disruptions.

Real-World Stress Tests: A Missed Opportunity for Learning

To gauge whether external, high-profile events influenced restoration behavior, Keepit investigated three significant incidents: the solar flares in May 2024, the CrowdStrike incident in July 2024, and the Microsoft outages in October 2025. These events represented potential widespread disruptions that could have caused data loss or unavailability across various sectors. The solar flares, for instance, posed risks to satellite communications and power grids, potentially impacting data centers. The CrowdStrike incident involved a major cybersecurity vendor, with widespread implications for endpoint security. The Microsoft outages, a recurring concern for cloud-dependent businesses, could lead to significant downtime for critical services.

The results of this investigation were profoundly worrying: none of these events prompted any discernible change in user behavior. There was no sign of increased activity to confirm that backups were working or to test recovery processes in the days and weeks following these incidents. This lack of proactive validation after significant "awareness moments" suggests a dangerous complacency or a reactive mindset that waits for direct impact before considering action.

The report proposed two theories for this observed behavior. First, organizations might not have experienced widespread, immediate restoration needs as a direct result of these specific events, leading to a false sense of security. Second, and perhaps more critically, the results suggest that "awareness moments" – even those with clear potential for disruption – do not automatically translate into changes in established recovery routines or an impetus for testing. This points to a deeper cultural and procedural challenge within enterprises, where lessons from external events are not systematically integrated into internal resilience strategies.

A Proactive Path Forward: Guided Recovery and MCP

The report’s authors strongly advocate for a proactive approach rather than a reactive one to similar events. They suggest that "Organizations can use external events as structured triggers for guided recovery checks – short, repeatable validations that reinforce confidence without requiring large-scale, disruptive exercises." This strategy aims to integrate learning from real-world incidents into ongoing resilience practices, making recovery validation a continuous process rather than an infrequent, burdensome task.

A key recommendation involves implementing "guided recovery" enabled by Model Context Protocol (MCP). This innovative approach opens the door to "asking for help" in the moment that matters, leveraging AI to assist in recovery operations. An MCP-enabled assistant could help identify unhealthy tenants or suspicious patterns in protected data, guiding administrators through the correct recovery steps. This transforms recovery from a complex, potentially error-prone manual process into a manageable, repeatable, and even partially automated one, significantly reducing recovery time objectives (RTOs) and recovery point objectives (RPOs).

Kim Larsen reiterated the fundamental importance of clarity and decisive action during an incident: "It all boils down to knowing who is in charge of recovery and which systems are restored first when multiple systems are affected. When decisions are delayed, recovery takes longer than necessary." This underscores the need for clear roles, responsibilities, and prioritized recovery sequences, especially in the complex, interconnected landscape of agentic AI systems.

Broader Implications for Enterprise Resilience

The findings from the Keepit survey carry significant implications across several dimensions of enterprise operations.

Cybersecurity Posture: The lack of robust, tested recovery plans for agentic AI and critical identity systems creates gaping holes in an organization’s overall cybersecurity posture. A successful cyberattack targeting an AI system or IAM infrastructure, coupled with inadequate recovery capabilities, can lead to prolonged outages, data breaches, and systemic compromise, far exceeding the impact of traditional data loss.

Regulatory Compliance: With increasing global regulations like GDPR, CCPA, HIPAA, and emerging AI-specific legislation such as the EU AI Act, organizations face stringent requirements for data protection, system resilience, and accountability. The inability to recover critical AI systems or the data they process efficiently can result in hefty fines, legal repercussions, and a loss of trust from customers and regulators. The Digital Operational Resilience Act (DORA) in the EU, for example, places a strong emphasis on testing digital operational resilience, making the survey’s findings particularly pertinent for financial entities.

Reputational and Economic Impact: Prolonged downtime due to an AI-driven failure can severely damage an organization’s reputation, eroding customer trust and stakeholder confidence. Beyond the direct financial costs of remediation, lost revenue, and potential fines, the long-term impact on brand image and market position can be devastating. The cost of downtime, already estimated to be substantial for traditional IT systems, would likely escalate significantly for failures involving complex, autonomous AI.

The Human Element and Skill Gaps: The reliance on sophisticated AI systems often requires specialized skills for both deployment and recovery. The survey hints at potential skill gaps within IT teams when it comes to managing and recovering agentic AI systems, especially given the observed lack of control and confidence. Investing in training and developing specialized expertise for AI resilience will be crucial.

Recommendations for Robust AI Disaster Recovery

To bridge the gap between confidence and genuine preparedness, enterprises must adopt a multi-faceted approach to AI disaster recovery:

AI-Specific Risk Assessments: Conduct thorough risk assessments for all agentic AI deployments, identifying unique failure modes, potential impacts, and dependencies.
Dedicated AI DR Plans: Develop specific, detailed disaster recovery plans tailored to agentic AI systems, distinct from generic IT DR plans. These plans should account for AI-specific challenges like model integrity, data provenance, and ethical considerations during recovery.
Frequent and Systematic Testing: Implement a rigorous schedule for testing AI recovery plans, including full-scale simulations and granular guided recovery checks, incorporating learnings from external events.
Prioritize Identity Systems: Elevate the testing frequency and robustness of recovery plans for identity and access management systems, recognizing their foundational role in enterprise security and recovery.
Robust Data Governance: Establish comprehensive data governance frameworks that ensure data quality, integrity, and traceability for AI systems, enabling confident recovery.
Leverage AI for Recovery: Explore and adopt tools and methodologies like MCP-enabled assistants for guided recovery, enhancing efficiency and reducing human error during critical incidents.
Clear Accountability: Define clear roles, responsibilities, and escalation paths for AI disaster recovery, ensuring swift decision-making and coordinated action.
Proactive Monitoring and Alerting: Implement advanced monitoring solutions that can detect anomalies and potential failures in agentic AI systems, triggering early intervention and recovery protocols.
Multi-Cloud and Hybrid Strategies: Diversify infrastructure and data storage across multiple cloud providers or hybrid environments to reduce single points of failure for critical AI workloads.

The Keepit report serves as a stark warning and a critical call to action for enterprises globally. While the allure of agentic AI promises transformative benefits, neglecting the foundational elements of resilience and recovery poses an existential threat. The current confidence-action gap must be urgently addressed through structured planning, rigorous testing, and a proactive embrace of innovative recovery methodologies, ensuring that the promise of AI is matched by robust preparedness for its inevitable challenges.

The full report, "Data Report 2026," is available on the Keepit website (registration required).