Rolling restart of ESXi management agents using PowerCLI

Right after my (waaay too short) holidays finished I was faced with the issue described in this KB. Nothing really serious, but an annoyance, that everybody wants to get rid of as soon as possible.
It didn’t take very long before I realized that (in my case) the only effective workaround was to perform “a rolling restart of management agents on affected hosts” (it was a whole 16 host cluster in my case).
Of course I wasn’t going to do that manually, especially that SSH was stopped by default in that cluster. PowerCLI is the ultimate answer to such repeatable tasks, but my initial enthusiasm was held back for a while, when I realized hostd daemon can not be restarted with Restart-VMHostService cmdlet…

I didn’t give up though and decided to use good ol’ plink.exe utility (that I remember back from Windows cmd scripting days 😮 ).

The script’s logic is pretty simple, gather all accessible ESXi hosts (where the agent restart needs to be performed), get the credentials needed to open SSH session, use plink.exe to feed the SSH with sequence of shell commands required to restart both hostd and vpxa. Oh, and for those who keep SSH stopped for their hosts – start and stop TSM-SSH service as needed.

This “piece of cake code” goes as follows:

This isn’t unfortunately the kind of script that will download plink.exe for you if it can’t find it 😉 . You still need to satisfy some prerequisites to run it, alright?

First of all – this PowerCLI script requires plink.exe program to be present in the same directory where script itself is located. I used Release 0.63 of this software, but I’d risk a statement any version will do, since I’m using it in very basic manner.

Secondly – a text file with fhe name defined by $remote_command variable (in Line 110) is also expected in the script’s working directory. In the case of restarting management agents I named this text file rstagtsqc.txt and its contents looks like that.

I’m using very basic PowerCLI here, nothing really exciting happens until Line 165, where the script will pop-up a “standard” authentication dialog, to ask you for username and password, that will be used to open SSH session to each ESXi host in the selected location (typically a host cluster, but even whole datacenter is possible). Please note that this has to be the same username/password pair for every host, so preferably some shell-enabled service account you have created on every host (I don’t want to use the r-word 😉 ). If your hosts are connected to Active Directory that can be in fact any member of “ESX Admins” group (or equivalent, so… your account? 😉 )
Once the credentials are provided, script needs to convert the password stored as SecureString type, to plain-text that will be used by plink.exe. This magic hapens between Line 168 and Line 173, as you can see the memory where this conversion happened is zeroed in Line 173, right after the password is copied to $decrypt_pass variable.

$command variable is just a string that includes invocation of plink.exe together with our options, so name of text file with shell commands sequence, SSH credentials and last but not least – the name of ESXi host we are currently connecting in the foreach loop. The only tricky part here is “echo Y | ” at the beginning of the $command string. It is just passing “Y” through a pipe as confirmation for plink.exe, that we agree to accept host’s RSA2 key (in case you never connected with PuTTy from your workstation to this host before).

If this PowerCLI script had to start SSH service before attempting restart of management agents ($flag variable) it will try to be nice and stop it before moving to next host. I needed try-catch sequence for this between Line 195  and Line 205, simply because you can’t change anything on the host (via vCenter – and we are connected to vCenter, right?) until vpxa daemon is up-and-running (and we’d just restarted it, right?).
Attempts to stop SSH while vpxa is not (yet) running will generate “503 – Service unavailable” errors, that you will be able to see (also) in your vSphere client, but it is safe to ignore them. Well OK, it is safe to ignore one of two errors of this kind per host, if you see a series of them for a single host it means something has gone terribly wrong, and you need to take a manual action (the script will not proceed until it is able to execute this step).

The single 30 second sleep in Line 192 is there for very similar reason – Call me old-school, but I prefer to give management agents some time to restart instead of shooting them right through whole cluster at once 🙂

As usual with the scripts I write – they try to log their steps, also the plink.exe output (if any) is gathered into $output variable and saved to a file before exiting, you can even start PowerShell transcript for this whole PowerCLI session if you like.

To avoid any risk of “plain-text passwords left stored in memory” I explicitly remove both variables where the password was stored (Line 211 and Line 212)

In this post I described a method of using PowerCLI together with plink.exe utility to remotely restart management agents on a set of ESXi hosts. By just modifying the contents of text file indicated by $remote_command variable you can use the script described here as a “framework” to remotely execute sequences of shell commands (so shell scripts!) not only for your ESXi hosts, but in fact for anything that you can connect via SSH (so Linux servers, network equipment, UPS devices, server enclosures… sky is (almost) the limit).

I hope you will find this post useful – feel free to share and/or provide your feedback!

Sebastian Baryło

Successfully jumping to conclusions since 2001.