Pages

Wednesday, November 8, 2023

Behind the scenes: Terraform's Deletion and the Mysterious Auto-Restoration of Azure Ad Enterprise Apps

Context:

A few weeks ago, an unexpected situation unfolded in one of the customer's production environment. It all started when a member of their team decided to pull the trigger on "terraform destroy" command. 

Their intention was to remove a specific app registration from Azure AD that they deployed with a Terraform package, however, little did they know that this (un-tested) action would set off a chain of events that none could have predicted, as the command ended up deleting several other Microsoft's first party enterprise applications from the Azure tenancy. 

Issue:

This activity left the production environment in chaos. You might be wondering why this matters? because, those first-party MS enterprise applications are the backbone of many services within the Azure AD tenant. Their sudden deletion created disruptions throughout the environment, affecting numerous other apps and programs that could be relying on those apps.

Troubleshooting & terraform quirk:

After things calmed down and everyone understood the consequences, the team started looking into what happened. Our mission was clear: discover the mystery of why the "destroy" command had such far-reaching consequences, affecting not only the intended target but also other critical apps of the Azure AD environment.

While the production environment was being restored back to the normal state by manually restoring those deleted enterprise apps, we decided to re-create the scenario in a safe demo tenant of Microsoft. Our experiment worked flawlessly, confirming the destructive behavior, however, it didn't resolve our current issue and answered "why".

Later, we stumbled upon an issue reported in Terraform' s Azure AD provider that highlights the behavior of the destroy command and appears to be a bug. If you have used the setting "use_existing=true" in your terraform code as shown in an example below to set the linkage between your Azure AD app and other SPNs, the destroy command goes  a rampage, deleting not only your app but also every relying linked SPNs it could find, regardless of their origin. E.g. even Microsoft's first-party enterprise applications in this case such as SharePoint Online, Exchange Online, Intune, and MS Teams. 

resource "azuread_service_principal" "sharepoint" {
  application_id = data.azuread_application_published_app_ids.well_known.result.Office365SharePointOnline
  use_existing   = true
}

With that, it explains the question, "why".

Auto-restore Puzzle:

While we managed to re-create the scenario in one of the internal demo tenancy, we stumbled upon a surprising observation as some of the enterprise apps we believed were gone for good started reappearing (without any manual restore operations), and it left us intrigued.

After running short of making guesses, we ended up reaching out to Azure support for answers and they confirmed the existence of automatic restoration and explained its unique behavior. 

When a user accesses services like MS Teams, Exchange, SharePoint, or OneDrive in the tenancy, it triggers Microsoft's underlying services to use the first-party enterprise apps. If it detects any of these apps missing or deleted, it performs the automatic restore of missing apps.

While this is generally the case for most apps, there are a few that don't follow the rule. E.g. Microsoft Intune API, it didn't want to join in this magical recovery process and who knows, there might be more apps with similar behavior hiding within Azure AD's depths. 

Learnings:

So, what's the moral of this story here? well, the lesson is simple. Before running any commands in your production environment, especially when the command name sounds a little scary, think twice or thrice. 

This post also attempts to uncover the myth surrounding this undocumented automatic restoration behavior of Azure AD enterprise apps. Hope it helps someone who is equally surprised to see their enterprise apps automatically restoring without any manual updates.