Survey: Enterprises Say They Are Ready for Agentic AI Failures, but Few Test Recovery Often

Most enterprise organizations globally express high confidence in their ability to recover from disruptions involving agentic AI systems, yet a recent comprehensive survey reveals a critical disconnect: a significant majority rarely test these crucial recovery plans with sufficient frequency to validate their effectiveness. This paradox, uncovered by a study involving over 300 IT decision-makers across Australia, New Zealand, Europe, the United Kingdom, and the United States, casts a shadow over the true resilience of modern enterprises increasingly reliant on autonomous artificial intelligence.

The survey, spearheaded by Keepit, a Denmark-based vendor-independent cloud backup and recovery service, illuminated a stark contrast between perception and practice. An overwhelming 94% of respondents affirmed their belief that their existing disaster recovery (DR) strategies adequately encompass agentic AI systems. However, this robust confidence stands in stark opposition to the operational reality, as only 32% of these same organizations reported testing their agentic AI recovery plans on a monthly basis. This infrequent validation suggests a potential vulnerability, where untested assumptions could lead to prolonged downtime and significant data loss in the event of an actual AI-driven disruption.

The Expanding Landscape of Agentic AI and Its Unique DR Challenges

Agentic AI refers to artificial intelligence systems capable of autonomous decision-making, goal-setting, and execution without constant human intervention. These systems are designed to perceive their environment, act upon it, and adapt to achieve specific objectives, often learning and evolving independently. Examples span from advanced automation bots managing complex IT infrastructures to sophisticated algorithms optimizing supply chains, financial trading, or even critical security operations. The integration of agentic AI promises unprecedented efficiency and innovation, but it also introduces novel and complex challenges for disaster recovery and business continuity planning. Unlike traditional software systems, agentic AI’s autonomous nature means failures can propagate rapidly, leading to cascading effects across interconnected systems. Data poisoning, unexpected algorithmic biases, or even malicious manipulation of an agentic system could trigger widespread operational paralysis, making traditional, reactive recovery strategies potentially inadequate.

"Organizations need to put more emphasis on creating long-term, structured, and tested disaster recovery plans," stated Kim Larsen, Group Chief Information Security Officer at Keepit, in a public statement accompanying the report. Larsen further emphasized the foundational role of robust data governance and accountability, asserting that these elements are indispensable for any effective resiliency plan. His comments underscore the growing industry consensus that as AI systems become more autonomous, the need for proactive, rigorously tested recovery protocols becomes paramount.

Survey: Enterprises Say They Are Ready for Agentic AI Failures, but Few Test Recovery Often -- Campus Technology

Gaps in Control and Pervasive Doubts

The survey’s findings extend beyond testing frequency, highlighting more profound systemic issues. A troubling 33% of IT and security leaders confessed to having only partial control over the deployment and operation of agentic AI within their organizations. This lack of comprehensive oversight creates significant blind spots, making it challenging to accurately assess risks, implement security measures, and, crucially, formulate effective recovery strategies. Furthermore, a substantial 52% of respondents expressed explicit doubts regarding whether their current recovery plans genuinely address the intricate scenarios that could arise from agentic AI failures. This internal skepticism among key decision-makers points to a significant knowledge gap or a realization that traditional DR frameworks may not be sufficiently agile or comprehensive for the complexities of AI-driven environments.

These figures paint a concerning picture where enterprises are adopting powerful, autonomous AI technologies without fully understanding or mitigating the associated risks. The potential implications range from financial losses due due to operational downtime, reputational damage from service interruptions, to severe regulatory penalties for data breaches or compliance failures stemming from inadequately managed AI systems.

Inconsistent Testing Regimes and Overlooked Critical Systems

While 90% of organizations reported having evaluated large-scale data recovery at least once, the survey revealed a pervasive inconsistency in testing across various systems. Recovery exercises, when conducted, are often not frequent or systematic enough to truly validate readiness across the entire IT landscape. This sporadic approach means that while some critical data may be recoverable, the interconnectedness of modern IT infrastructure dictates that a chain is only as strong as its weakest link.

Perhaps one of the most alarming revelations was the widespread oversight of identity and access management (IAM) systems in recovery planning. Identity-related infrastructure, such as Microsoft’s Entra ID (formerly Azure Active Directory) and Okta, which serve as the gatekeepers for user authentication and access to virtually all enterprise resources, are tested far less frequently than other data systems. The report noted a staggering disparity: productivity applications like Microsoft 365, Google Workspace, and Salesforce are restored, on average, four times as frequently as identity applications. "For every four companies who run a yearly test on their productivity workload, only one of them (25%) will have run a test on their identity applications," the report explicitly stated.

This disparity represents a critical vulnerability. In the event of a major disaster, if identity systems are compromised or irrecoverable, the ability to authenticate users, restore access to critical applications, and consequently, resume business operations, would be severely hampered, regardless of how quickly other data systems are brought back online. The recovery of an enterprise hinges on the ability of its legitimate users to access their tools and data, making IAM systems foundational to any comprehensive DR strategy.

The Predominance of Granular Restores and the Illusion of Readiness

The survey also delved into the nature of recovery activities, finding that most restore operations involve single-file downloads. This reflects routine operational needs—such as retrieving a mistakenly deleted document or an older version of a file—rather than large-scale, enterprise-wide recovery events. While such granular restores are vital for daily productivity and demonstrate the functionality of backup systems for individual items, they can create a false sense of security regarding an organization’s readiness for a catastrophic event. The ability to recover a single file does not automatically translate to the capability to restore an entire system, an application stack, or an entire data center following a major outage or cyberattack, especially one involving the complex interdependencies of agentic AI.

The report’s authors underscored that the true value of backup lies not just in data preservation, but in the confidence, correctness, and efficiency of recovery, whether the need is small and immediate or broad and time-critical. Interestingly, restore activity was found to be notably stronger among larger organizations, suggesting that enterprises with greater resources might be more actively engaged in recovery operations, even if not always for comprehensive DR validation.

Missed Opportunities: External Events as Catalysts for Change

To gauge whether real-world crises influenced recovery behavior, Keepit specifically investigated three significant external events between 2024 and 2025 that could have potentially caused widespread data loss or unavailability:

Solar Flares in May 2024: A significant geomagnetic storm that prompted concerns about widespread communication and power grid disruptions.
CrowdStrike Incident in July 2024: A major cybersecurity incident affecting a widely used endpoint protection platform, causing outages for numerous organizations globally.
Microsoft Outages in October 2025: Predicted large-scale disruptions to Microsoft’s extensive cloud services, including Microsoft 365 and Azure.

The results of this investigation were particularly concerning: none of these high-profile "awareness moments" prompted any measurable change in user behavior. There was no discernible increase in activity to confirm the functionality of backups in the days and weeks following these events. This behavioral inertia suggests that even when faced with palpable threats, organizations often fail to translate external crises into internal actions to fortify their recovery postures.

The report proposed two primary theories for this phenomenon. First, it’s possible that organizations did not experience widespread, immediate restoration needs as a direct consequence of these specific events. While the potential was high, the actual impact might have been mitigated by various factors. Second, and perhaps more critically, the results suggest that mere "awareness moments" do not automatically translate into fundamental changes in ingrained recovery routines. The perceived urgency of a potential threat dissipates without a structured mechanism to convert that awareness into proactive validation and improvement of recovery plans.

A Call for Proactive Resilience: Guided Recovery and MCP

The solution, according to the report’s authors and industry experts, lies in shifting from a reactive stance to a proactive one. Instead of waiting for a catastrophic failure to expose vulnerabilities, organizations should leverage external events as "structured triggers for guided recovery checks." These are defined as short, repeatable validations designed to reinforce confidence in backup and recovery capabilities without necessitating large-scale, disruptive exercises. Such a proactive approach transforms potential threats into opportunities for continuous improvement and validation.

Furthermore, the report champions the implementation of "guided recovery" enabled by Model Context Protocol (MCP). MCP is proposed as a framework that allows systems to "ask for help" at critical moments, providing context-aware assistance during recovery operations. An MCP-enabled assistant could, for instance, help identify unhealthy tenants or suspicious patterns within protected data. It could then guide administrators through the precise, step-by-step recovery process, effectively transforming complex, potentially chaotic recovery events into manageable, repeatable procedures. This approach minimizes human error, accelerates decision-making, and reduces the overall mean time to recovery (MTTR).

"It all boils down to knowing who is in charge of recovery and which systems are restored first when multiple systems are affected," Larsen reiterated, emphasizing the critical importance of clear roles, responsibilities, and prioritized recovery sequences. "When decisions are delayed, recovery takes longer than necessary," he warned, highlighting the direct correlation between preparedness and business continuity.

Broader Implications for Business Resilience and Compliance

The findings of the Keepit survey carry profound implications for enterprise resilience, regulatory compliance, and the future of digital operations. In an era where data is the lifeblood of business and AI is rapidly becoming its nervous system, the inability to reliably recover from disruptions poses an existential threat. Inadequate recovery plans for agentic AI systems not only risk operational paralysis but also expose organizations to significant regulatory scrutiny. Data protection regulations like GDPR, CCPA, and others mandate robust data availability and recovery capabilities. Failures stemming from untested AI recovery plans could lead to substantial fines, legal challenges, and severe damage to an organization’s reputation and customer trust.

The push towards agentic AI, while promising innovation, demands a parallel evolution in an organization’s approach to risk management and disaster recovery. This includes not just investing in sophisticated backup technologies, but also fostering a culture of continuous testing, clear accountability, and proactive adaptation. Enterprises must move beyond mere confidence to demonstrated capability, ensuring that their readiness for agentic AI failures is not just a statement, but a verified, operational reality.

The full report, offering detailed insights and further recommendations, is accessible on the Keepit website, though registration is required to download it. It serves as a crucial wake-up call for IT leaders worldwide to bridge the gap between perceived readiness and actual resilience in the age of autonomous AI.

Leave a Reply Cancel reply

Related News

You may have missed