From time to time, and for whatever reason…. like oh…say… your solution contains multiple projects and you decided to upgrade the cloud service project to the latest Azure SDK from version x and for whatever reason you have a reference to a specific Azure SDK component in one or more projects that are not your Web / Worker Role project… Guess what… you might run into this (Role perma-recycling fun) after deploying to Azure that when tested on the development fabric it is working just fine.
You know what, sometimes Azure baffles me. I upgraded my Azure SDK previously and managed to deploy it to the cloud and have it running just fine on one deployment and then when doing another deployment some time after it just refuses to start and start the recycling crap… and this is not a staging slot swap to prod slot thing, it’s direct deployment onto the production slot too. I haven’t deep dive into Azure internal to understand what’s going on when you do deployment, but I think there are optimization steps (to make the deployment process faster perhaps) where sometimes the same VM is being used and when previously it still have the residual DLLs from previous deployment and when this happens, everything work just fine even after the SDK update and you forgot to update the outdated DLL… and sometimes, Azure just says, “… meh… screw you… I’ll drop your bit on a totally new VM and guess what, HAHA!!, the bits that you think will be there is no longer there, so there…” and the recycling crap happens.
Is this what happened? I don’t know… maybe. Some insights into this perhaps or not… http://blog.smarx.com/posts/what-happens-when-you-deploy-on-windows-azure. Step 2 is a bit ambiguous… Maybe Azure Friday will invite someone to explain how this thing works internally? *Nudge nudge wink wink*
You know what… this reminds me of something else… Sometimes when my Continuous Deployment build got processed by Visual Studio Online, I get this annoying Cannot copied such and such blah error during the build and the build crapped and stopped there. FAIL! And the only thing I need to do is re-queue the build and guess what… It successfully build and deploy…
What the !(*&$&!#!!! My theory… there is a build agent machine that is not configured right… and I was just unlucky to get sent to that agent from time to time.
*End of more digression*
Geez… what is this… Digression Inception?
If you are a newb at Azure like I was previously, you’ll be scratching your head for a looong looong time trying to figure out what the heck is going on…
It worked on my machine… Damn it..
*pull hair* … oh no *bald spot forming*.
After running into this issue a couple of time and doing searches for answers on stack overflow and what not… You’ll finally realize that you have missing reference and the web / worker role just can’t start your application because of that, which I think is the most likely candidate anyhow…
So, how do you go about troubleshooting this?
Remote Desktop is Your Friend
One easy way to do this is via Remote Desktop.
You DID configure remote desktop on that role, did you not…?
No? You dummy you, go do it now!
… … …
You DO know how to configure remote desktop for your role, do you not?
No? *sigh* Read this.
I’d recommend creating your own client certificate to use for this and CHECK IT INTO YOUR SOURCE REPOSITORY. You never know when you need the certificate again… for example, when you need to redeploy the cloud service to a different Azure subscription and the original certificate is now nowhere to be found for reasons like the original developer who created the certificate on his / her machine left the company, the machine got wiped and YAY… no more certificate copy anywhere… Happened… True story.
So, now that’s out of the way, go launch remote desktop and login…
What? You don’t remember what’s the password for the VM? Are you kidding me? *sigh*
Go read this and follow the instruction on how to reconfigure…
Event Viewer is Also Your Friend
Okay… now that you are logged into the remote desktop, go start Event Viewer. If you don’t know how to do this, go quit and sell ice cream in a truck… or ask your boss to hire a DevOps... or go learn how to do it!! Google or Bing maybe… like “How to launch Event Viewer from Windows Server 2012 R2?”…
Got Event Viewer up? Good.
Now go open the Windows Logs node and drill into the Application node and find some errors (such as one that is shown below…)
AHA! See that FileNotFoundException!!! I knew it, File Not Found…. It’s the same thing that is shown in the Azure portal.
Yeah…. but… what file?…. Uhm….
What a freaking useless error message.
Okay… maybe we are looking at the wrong place…. Let’s see what else is there…
Let’s try the Applications and Services Logs node… drill drill, aha… Windows Azure/Diagnostics/Bootstrapper node… It’s related to starting up the role right? Bootstrapping = start up, right? Must be here..
AHA! I see some errors.
… WT #(@$! CRuntimeClient: OnRoleStatusCallback #(*@#(*@#(*$ What the heck does that mean?
Yet more useless crap!
*sigh* Oh well, let’s continue mining…
Applications and Services Logs/Windows Azure… WELL… what do you know… More errors… and this time.. we found GOLD… well, sort of…!
Could not load file or assembly ‘Microsoft.WindowsAzure.ServiceRuntime, Version=220.127.116.11…’ at XXXXXXX.Shared.Configuration….
Now we are going somewhere…
Well, now that you know what it can’t find during the startup, you just need to ensure that the DLL or whatever component the role needs to start is correctly packaged and deployed to Azure. Upgrade the DLL using Nuget and fix your references, and make sure to set the missing reference to be copied always to the bin folder.
Check in your changes, trigger your Continuous Delivery build or package it and redeploy manually. Whatever your deployment style is… Hopefully it will work this time…
If it doesn’t…,
Well, go figure it out. Gave you a clue on how to troubleshoot this… Do I have to hold you hand through all the process? What are you? Kindergarten student? PHAW!