Survey: Enterprises Say They Are Ready for Agentic AI Failures, but Few Test Recovery Often

Most enterprise organizations say they are ready to recover from disruptions involving agentic AI, but a new survey of more than 300 IT decision-makers from Australia, New Zealand, Europe, the United Kingdom, and the United States suggests relatively few test those plans often enough to prove it. The findings highlight a concerning chasm between perceived preparedness and demonstrable resilience in an increasingly AI-driven operational landscape, raising critical questions about the robustness of modern business continuity strategies.

The Rise of Agentic AI and Its Unique Risks

The rapid integration of artificial intelligence, particularly agentic AI, into enterprise operations marks a significant technological shift. Agentic AI refers to AI systems designed to act autonomously, making decisions and executing tasks without constant human oversight. These systems, often deployed in areas like automated customer service, supply chain optimization, financial trading, and cybersecurity, promise enhanced efficiency and innovation. However, their autonomous nature also introduces a new category of risk. Unlike traditional software, agentic AI systems can propagate errors, make unexpected decisions, or even experience "hallucinations" that lead to cascading failures across interconnected systems. The potential for such autonomous errors to trigger widespread data loss, operational downtime, or even security breaches necessitates a fundamentally different approach to disaster recovery and business continuity planning.

According to recent industry analyses, the global AI market is projected to grow exponentially, with enterprise adoption rates soaring. This rapid uptake, while beneficial, often outpaces the development and implementation of robust governance and recovery frameworks. Cybersecurity experts frequently warn that the speed of AI deployment, coupled with a lack of comprehensive understanding of its failure modes, could leave organizations vulnerable. The stakes are high: a major disruption involving agentic AI could not only cripple operations but also inflict severe reputational damage, financial losses, and regulatory penalties.

The Disconnect: Confidence Versus Preparedness

Survey: Enterprises Say They Are Ready for Agentic AI Failures, but Few Test Recovery Often -- Campus Technology

The survey, conducted by Keepit, a vendor-independent cloud backup and recovery service based in Denmark, revealed a striking dichotomy. A substantial 94% of respondents expressed confidence that their disaster recovery plans adequately covered agentic AI systems. This high level of self-assuredness, however, stands in stark contrast to the frequency of their recovery plan testing. Only 32% of these confident organizations reported testing their plans monthly, suggesting a significant portion may be relying on untested assumptions rather than proven capabilities.

This gap between confidence and verified readiness is a critical vulnerability. Industry best practices for disaster recovery and business continuity unequivocally advocate for frequent and rigorous testing. Annual or semi-annual tests are often considered the bare minimum for traditional IT systems, but for complex, autonomously operating AI systems, more frequent validation is imperative. The dynamic nature of AI models, their continuous learning processes, and their integration with evolving data sets mean that recovery plans can quickly become outdated if not regularly re-evaluated and tested against realistic failure scenarios. The cost of downtime, which can range from thousands to millions of dollars per hour depending on the industry and scale of the enterprise, underscores the financial imperative for robust and tested recovery capabilities. Data from various research firms consistently show that organizations with mature DR testing programs experience significantly shorter recovery times and lower financial impact during actual incidents.

Compounding this issue, a troubling 33% of IT and security leaders admitted to having only partial control over the use of agentic AI within their organizations. Furthermore, 52% harbored doubts about whether their existing recovery plans truly encompassed agentic AI scenarios. This lack of full oversight and internal uncertainty points to a potential governance crisis, where AI initiatives might be progressing without clear, centralized management of associated risks and recovery protocols. Kim Larsen, Group Chief Information Security Officer at Keepit, emphasized this point, stating, "Organizations need to put more emphasis on creating long-term, structured, and tested disaster recovery plans. This also means putting a spotlight on data governance and accountability, which is the foundation for any resiliency plan." Without clear accountability and comprehensive governance, the foundational elements for effective recovery are inherently weak.

Critical Oversight: Identity and Access Management (IAM)

Among the key findings, the survey highlighted that while most organizations had evaluated large-scale data recovery at least once (around 90%), testing was neither frequent nor systematic across all critical systems. A particularly alarming omission in recovery planning was identified in the realm of identity and access management (IAM). Identity-related systems, such as Microsoft’s Entra ID (formerly Azure Active Directory) and Okta, are fundamental to securing access to all enterprise resources, including agentic AI applications. Yet, these systems are tested far less often than other data systems.

Compared to productivity applications such as Microsoft 365, Google Workspace, and Salesforce, Keepit found that, on average, productivity applications are restored four times as frequently as identity applications. The report explicitly stated, "For every four companies who run a yearly test on their productivity workload, only one of them (25%) will have run a test on their identity applications." This disparity represents a critical blind spot. If an organization cannot restore its identity and access management systems following a disruption, it effectively loses the ability to authenticate users, control access, and ultimately, recover any other system. A compromised or unavailable IAM system can lead to widespread operational paralysis, unauthorized data access, or the inability to restore critical services, even if the underlying data backups are intact. This oversight could be catastrophic, as modern cybersecurity frameworks increasingly emphasize identity as the new perimeter.

The "Awareness Moments" That Weren’t: A Chronology of Missed Opportunities

The report’s authors undertook a specific investigation to determine whether external, high-profile events had prompted any changes in restoration behavior among enterprises. They examined three distinct incidents in recent history that could have caused data loss or unavailability: the solar flares in May 2024, the CrowdStrike incident in July 2024, and the Microsoft outages in October 2025. These events represent a spectrum of potential disruptions, from natural phenomena with widespread infrastructure implications to software vulnerabilities and major service outages.

May 2024 Solar Flares: A series of intense solar flares and geomagnetic storms impacted Earth, causing disruptions to GPS, satellite communications, and potentially power grids. Such events can interfere with critical digital infrastructure, posing a risk to data centers and network stability.
July 2024 CrowdStrike Incident: Although details are not fully elaborated in the original article, a hypothetical "CrowdStrike incident" in this context would likely refer to a significant security event affecting a widely used cybersecurity platform. Such an incident could trigger widespread concerns about endpoint protection, network integrity, and the efficacy of existing security measures, prompting organizations to verify their backup and recovery systems.
October 2025 Microsoft Outages: Major service outages affecting a cloud giant like Microsoft (e.g., Azure, Microsoft 365) could have vast implications for countless enterprises relying on these platforms for critical operations, data storage, and application hosting. Such an event would undoubtedly prompt organizations to assess their resilience and ability to recover data and services hosted on affected platforms.

The worrying outcome of Keepit’s investigation was that none of these significant external events prompted any measurable change in user behavior. There was no discernible increase in activity to confirm that backups were working in the days and weeks following these incidents. This suggests two primary theories, as proposed by the report’s authors: first, organizations may not have experienced widespread, immediate restoration needs as a direct result of these specific events; and second, and more critically, these "awareness moments" did not automatically translate into proactive changes in recovery routines. This inertia indicates a systemic issue where even clear external triggers fail to motivate a re-evaluation of recovery readiness, leaving organizations perpetually reactive rather than proactive.

Financial and Operational Stakes of Recovery Gaps

The implications of these recovery gaps are profound, extending beyond mere technical inconvenience. Financially, inadequate disaster recovery can lead to staggering costs. Downtime, data loss, and recovery efforts can deplete resources, erode profits, and even threaten business viability. A 2023 report by IBM and Ponemon Institute indicated the average cost of a data breach reached a record high of $4.45 million, a figure that continues to climb, especially with the introduction of complex AI systems. The reputational damage from a major outage or data loss event can be even more enduring, leading to customer churn, loss of trust, and negative brand perception.

Operationally, the inability to recover critical systems, especially those powered by agentic AI, can halt core business functions, disrupt supply chains, and impact customer service. In highly regulated industries, such as finance or healthcare, such failures can also trigger severe penalties for non-compliance with data protection and operational resilience mandates. Regulators globally are increasingly scrutinizing how organizations manage risks associated with emerging technologies, including AI. The absence of robust, tested recovery plans for agentic AI could expose enterprises to significant legal and compliance liabilities.

Expert Perspectives and the Path Forward

The report also found that most restore activity involves single-file downloads, reflecting routine operational needs rather than large-scale recovery events. While granular recovery is essential for daily operations, it does not adequately prepare an organization for catastrophic, system-wide failures. The authors note that backup creates value when organizations can recover confidently, correctly, and efficiently, whether the need is small and immediate or broad and time-critical. Restore activity is strong among larger organizations, suggesting that smaller and medium-sized enterprises might be even more vulnerable due to resource constraints or less mature DR strategies.

Kim Larsen’s emphasis on data governance and accountability is a cornerstone for building true resilience. This involves clearly defining roles and responsibilities for data management, establishing policies for AI system deployment and monitoring, and ensuring that recovery objectives are aligned with overall business goals. Without a strong governance framework, recovery efforts can be chaotic, leading to delays and increased risk.

The Promise of Proactive and Guided Recovery (MCP)

Given the observed inertia, the report’s authors advocate for a shift from reactive responses to proactive strategies. They propose that "Organizations can use external events as structured triggers for guided recovery checks – short, repeatable validations that reinforce confidence without requiring large-scale, disruptive exercises." This approach transforms potential crises into learning opportunities, allowing enterprises to validate their recovery mechanisms incrementally and frequently without the overhead of full-scale simulations.

A key innovation suggested is the implementation of "guided recovery" enabled by Model Context Protocol (MCP). MCP essentially opens the door to "asking for help" in the moment that matters. An MCP-enabled assistant could play a crucial role by helping identify unhealthy tenants or suspicious patterns in protected data. More importantly, it can guide administrators through the correct recovery steps, effectively turning a complex, high-stress recovery process into a manageable, repeatable workflow. This concept aligns with the broader industry trend towards AI-augmented IT operations, where AI tools assist human operators in navigating complex system landscapes and troubleshooting issues. Such an assistant could reduce human error, accelerate recovery times, and ensure adherence to established protocols, even for less experienced personnel.

"It all boils down to knowing who is in charge of recovery and which systems are restored first when multiple systems are affected," Larsen reiterated. "When decisions are delayed, recovery takes longer than necessary." This highlights the critical importance of clear leadership, predefined priorities, and streamlined decision-making processes during a crisis. A well-defined incident response plan, integrated with guided recovery tools, can significantly mitigate the impact of agentic AI failures.

Broader Industry Implications and Future Outlook

The findings from the Keepit survey serve as a stark warning to the global enterprise community. As agentic AI becomes more pervasive, the risks associated with its failure modes will only intensify. The current state of recovery readiness, characterized by high confidence but insufficient testing and critical blind spots in IAM, is unsustainable. The industry must move towards a more mature and proactive approach to disaster recovery that specifically addresses the unique challenges posed by autonomous AI systems. This includes:

Mandating Frequent Testing: Regular, comprehensive testing of all recovery plans, including those for agentic AI and critical IAM systems, should become a non-negotiable standard.
Enhanced Data Governance: Establishing robust data governance frameworks that define accountability, control, and recovery protocols for all AI systems.
Investing in AI-Specific DR Tools: Adopting advanced tools and protocols, such as MCP-enabled guided recovery, to assist and automate aspects of the recovery process.
Continuous Education and Training: Equipping IT and security teams with the knowledge and skills necessary to manage and recover complex AI environments.
Learning from External Events: Developing mechanisms to proactively respond to "awareness moments" by triggering internal recovery checks and plan adjustments.

The journey towards true resilience in the age of agentic AI is not merely a technical challenge but a strategic imperative. Organizations that embrace proactive planning, rigorous testing, and innovative recovery solutions will be better positioned to navigate the inevitable disruptions and leverage the transformative power of AI securely and sustainably. The full report offers deeper insights and actionable recommendations, available on the Keepit site (registration required).

Leave a Reply Cancel reply

Related News

You may have missed