Right after my (waaay too short) holidays finished I was faced with the issue described in this KB. Nothing really serious, but an annoyance, that everybody wants to get rid of as soon as possible.
It didn’t take very long before I realized that (in my case) the only effective workaround was to perform “a rolling restart of management agents on affected hosts” (it was a whole 16 host cluster in my case).
Of course I wasn’t going to do that manually, especially that SSH was stopped by default in that cluster. PowerCLI is the ultimate answer to such repeatable tasks, but my initial enthusiasm was held back for a while, when I realized hostd daemon can not be restarted with Restart-VMHostService cmdlet…
I didn’t give up though and decided to use good ol’ plink.exe utility (that I remember back from Windows cmd scripting days 😮 ).
The script’s logic is pretty simple, gather all accessible ESXi hosts (where the agent restart needs to be performed), get the credentials needed to open SSH session, use plink.exe to feed the SSH with sequence of shell commands required to restart both hostd and vpxa. Oh, and for those who keep SSH stopped for their hosts – start and stop TSM-SSH service as needed.
This “piece of cake code” goes as follows:
#requires -version 2 <# .SYNOPSIS Script to execute "a rolling restart" of management agents on ESXi hosts in selected vi container (cluster, folder, datacenter) .DESCRIPTION The script connects to vCenter indicated as parameter then searches for all accessible ESXi hosts in vi container (typically a cluster) indicated as second parameter. Subsequently credentials required to open SSH connection to all hosts are gathered as user input. It is assumed that SSH service is stopped on each ESXi host, so the script first starts it up, then establishes SSH connection using plink.exe executable (expected to be saved in script's working directory). A sequence of shell commands to restart hostd and vpxa agents is executed on each host. This sequence is saved in text file rstagtsqc.txt that is expected to be saved in script's working directory. .PARAMETER vCenterServer Mandatory parameter indicating vCenter server to connect to (FQDN or IP address). .PARAMETER Location Mandatory parameter indicating location (vi containter) where management agents need to be restarted (typically a host cluster). .EXAMPLE restart_mgmt_agents.ps1 -vCenterServer vcenter.seba.local -Location Test-Cluster vCenter server indicated as FQDN. .EXAMPLE restart_mgmt_agents.ps1 -vcenter 10.0.0.1 -location production-cluster vCenter server indicated as IP address. .EXAMPLE restart_mgmt_agents.ps1 Script will interactively ask for both mandatory parameters. #> [CmdletBinding()] Param( [Parameter(Mandatory=$True,Position=1)] [ValidateNotNullOrEmpty()] [string]$vCenterServer, [Parameter(Mandatory=$True,Position=2)] [ValidateNotNullOrEmpty()] [string]$Location ) Function Write-And-Log { [CmdletBinding()] Param( [Parameter(Mandatory=$True,Position=1)] [ValidateNotNullOrEmpty()] [string]$LogFile, [Parameter(Mandatory=$True,Position=2)] [ValidateNotNullOrEmpty()] [string]$line, [Parameter(Mandatory=$False,Position=3)] [int]$Severity=0, [Parameter(Mandatory=$False,Position=4)] [string]$type="terse" ) $timestamp = (Get-Date -Format ("[yyyy-MM-dd HH:mm:ss] ")) $ui = (Get-Host).UI.RawUI switch ($Severity) { {$_ -gt 0} {$ui.ForegroundColor = "red"; $type ="full"; $LogEntry = $timestamp + ":Error: " + $line; break;} {$_ -eq 0} {$ui.ForegroundColor = "green"; $LogEntry = $timestamp + ":Info: " + $line; break;} {$_ -lt 0} {$ui.ForegroundColor = "yellow"; $LogEntry = $timestamp + ":Warning: " + $line; break;} } switch ($type) { "terse" {Write-Output $LogEntry; break;} "full" {Write-Output $LogEntry; $LogEntry | Out-file $LogFile -Append; break;} "logonly" {$LogEntry | Out-file $LogFile -Append; break;} } $ui.ForegroundColor = "white" } #constans #variables $ScriptRoot = Split-Path $MyInvocation.MyCommand.Path $StartTime = Get-Date -Format "yyyyMMddHHmmss_" $logdir = $ScriptRoot + "\RestartMgmtAgentsLogs\" $logfilename = $logdir + $StartTime + "restart_mgmt_agents.log" $transcriptfilename = $logdir + $StartTime + "restart_mgmt_agents_Transcript.log" $outputfilename = $ScriptRoot + "\invoke_plink_output.txt" $plink = $ScriptRoot + "\plink.exe" $remote_command = $ScriptRoot + "\rstagtsqc.txt" $total_errors = 0 $total_vmhosts = 0 $index_vmhosts =0 #test for log directory, create one if needed if ( -not (Test-Path $logdir)) { New-Item -type directory -path $logdir 2>&1 > $null } #start PowerShell transcript... or don't do it... #Start-Transcript -Path $transcriptfilename #load PowerCLI snap-in $vmsnapin = Get-PSSnapin VMware.VimAutomation.Core -ErrorAction SilentlyContinue $Error.Clear() if ($vmsnapin -eq $null) { Add-PSSnapin VMware.VimAutomation.Core if ($error.Count -eq 0) { write-and-log $logfilename "PowerCLI VimAutomation.Core Snap-in was successfully enabled." 0 "full" } else{ write-and-log $logfilename "Could not enable PowerCLI VimAutomation.Core Snap-in, exiting script." 1 "full" Exit } } else{ write-and-log $logfilename "PowerCLI VimAutomation.Core Snap-in is already enabled." 0 "full" } #check PowerCLI version if (($vmsnapin.Version.Major -gt 5) -or (($vmsnapin.version.major -eq 5) -and ($vmsnapin.version.minor -ge 1))) { #assume everything is OK at this point $Error.Clear() #connect vCenter from parameter Connect-VIServer -Server $vCenterServer -ErrorAction SilentlyContinue 2>&1 > $null #execute only if connection successful if ($error.Count -eq 0){ #measuring execution time is really hip these days $stop_watch = [Diagnostics.Stopwatch]::StartNew() #use previously defined function to inform what is going on, anything else than "terse" will cause the message to be written both in logfile and to screen Write-And-Log $logfilename "vCenter $vCenterServer successfully connected." $error.count "full" #get the vmhosts in location $vmhosts_in_cluster = get-vmhost -Location $location | where-object {$_.connectionstate -eq "Connected"} if ($vmhosts_in_cluster) { $total_vmhosts = $vmhosts_in_cluster.count #gather credentials to SSH to these hosts $cred = $Host.UI.PromptForCredential("ESX Host access credentials","Please provide credentials for SSH to ESXi hosts in $Location","","") #convert SecureString password to plain-text $pointer = [Runtime.InteropServices.Marshal]::SecureStringToBSTR($cred.password) $decrypt_pass = [Runtime.InteropServices.Marshal]::PtrToStringAuto($pointer) #and pretend we weren't there at all [System.Runtime.InteropServices.Marshal]::ZeroFreeBSTR($pointer) $plink_opts = "-l $($cred.username) -pw $decrypt_pass -m $remote_command " $invoke_plink = "echo Y | " + $plink + " " + $plink_opts foreach ($vmhost in $vmhosts_in_cluster) { write-progress -Activity "Restarting management agents for $Location container" -Status "Percent complete $("{0:N2}" -f (($index_vmhosts / $total_vmhosts) * 100))%" -PercentComplete (($index_vmhosts / $total_vmhosts) * 100) -CurrentOperation "Processing vSphere host: $($vmhost.name)" #start SSH service if needed $ssh_service = get-vmhostservice -vmhost $vmhost | where-object { $_.Key -eq "TSM-SSH"} if ( -not $ssh_service.Running){ Start-vmhostservice -hostservice $ssh_service -confirm:$false 2>&1 > $null $flag = $true } #call plink with command sequence prepared earlier. $command = $invoke_plink + $vmhost.name $output += invoke-expression -command $command 2>&1 Start-Sleep -seconds 30 #stop SSH service if we had started it (the loop is there to vait for vpxd to restart) while ($flag){ $flag = $false try { Stop-Vmhostservice -hostservice $ssh_service -confirm:$false -ErrorAction Stop 2>&1 > $null } catch { $flag = $true start-sleep -seconds 30 } } $index_vmhosts++ } #only paranoid will survive, so lets clear all the traces of plaintext password we had remove-variable -name plink_opts remove-variable -name decrypt_pass $output | out-file -filepath $outputfilename } else { write-and-log $logfilename "There are no VMHosts connected to $Location VI-Container." 1 "full" $total_errors += 1 } $stop_watch.Stop() $elapsed_seconds = ($stop_watch.elapsedmilliseconds)/1000 #farewell message before disconnect Write-And-Log $logfilename "Management agents successfully restarted for $index_vmhosts hosts in $Location VI-Container." $total_errors "full" Write-And-Log $logfilename "Script took $("{0:N2}" -f $elapsed_seconds)s to execute, exiting." -1 "full" #disconnect vCenter Disconnect-VIServer -Confirm:$false -Force:$true } else{ Write-And-Log $logfilename "Error connecting vCenter server $vCenterServer, exiting." $error.count "full" } } else { write-and-log $logfilename "This script requires PowerCLI 5.1 or greater to run properly." 1 "full" } #Stop-Transcript ...well, if you had started it
This isn’t unfortunately the kind of script that will download plink.exe for you if it can’t find it 😉 . You still need to satisfy some prerequisites to run it, alright?
First of all – this PowerCLI script requires plink.exe program to be present in the same directory where script itself is located. I used Release 0.63 of this software, but I’d risk a statement any version will do, since I’m using it in very basic manner.
Secondly – a text file with fhe name defined by $remote_command variable (in Line 110) is also expected in the script’s working directory. In the case of restarting management agents I named this text file rstagtsqc.txt and its contents looks like that.
/etc/init.d/hostd restart /etc/init.d/vpxa restart
I’m using very basic PowerCLI here, nothing really exciting happens until Line 165, where the script will pop-up a “standard” authentication dialog, to ask you for username and password, that will be used to open SSH session to each ESXi host in the selected location (typically a host cluster, but even whole datacenter is possible). Please note that this has to be the same username/password pair for every host, so preferably some shell-enabled service account you have created on every host (I don’t want to use the r-word 😉 ). If your hosts are connected to Active Directory that can be in fact any member of “ESX Admins” group (or equivalent, so… your account? 😉 )
Once the credentials are provided, script needs to convert the password stored as SecureString type, to plain-text that will be used by plink.exe. This magic hapens between Line 168 and Line 173, as you can see the memory where this conversion happened is zeroed in Line 173, right after the password is copied to $decrypt_pass variable.
$command variable is just a string that includes invocation of plink.exe together with our options, so name of text file with shell commands sequence, SSH credentials and last but not least – the name of ESXi host we are currently connecting in the foreach loop. The only tricky part here is “echo Y | ” at the beginning of the $command string. It is just passing “Y” through a pipe as confirmation for plink.exe, that we agree to accept host’s RSA2 key (in case you never connected with PuTTy from your workstation to this host before).
If this PowerCLI script had to start SSH service before attempting restart of management agents ($flag variable) it will try to be nice and stop it before moving to next host. I needed try-catch sequence for this between Line 195 and Line 205, simply because you can’t change anything on the host (via vCenter – and we are connected to vCenter, right?) until vpxa daemon is up-and-running (and we’d just restarted it, right?).
Attempts to stop SSH while vpxa is not (yet) running will generate “503 – Service unavailable” errors, that you will be able to see (also) in your vSphere client, but it is safe to ignore them. Well OK, it is safe to ignore one of two errors of this kind per host, if you see a series of them for a single host it means something has gone terribly wrong, and you need to take a manual action (the script will not proceed until it is able to execute this step).
The single 30 second sleep in Line 192 is there for very similar reason – Call me old-school, but I prefer to give management agents some time to restart instead of shooting them right through whole cluster at once 🙂
As usual with the scripts I write – they try to log their steps, also the plink.exe output (if any) is gathered into $output variable and saved to a file before exiting, you can even start PowerShell transcript for this whole PowerCLI session if you like.
To avoid any risk of “plain-text passwords left stored in memory” I explicitly remove both variables where the password was stored (Line 211 and Line 212)
In this post I described a method of using PowerCLI together with plink.exe utility to remotely restart management agents on a set of ESXi hosts. By just modifying the contents of text file indicated by $remote_command variable you can use the script described here as a “framework” to remotely execute sequences of shell commands (so shell scripts!) not only for your ESXi hosts, but in fact for anything that you can connect via SSH (so Linux servers, network equipment, UPS devices, server enclosures… sky is (almost) the limit).
I hope you will find this post useful – feel free to share and/or provide your feedback!
you sir, are a hero