In the final stage of my preparation to VCAP5-DCA exam (that I passed on September 1st – yeah I couldn’t help to brag a little about it 😉 ) I started to experience issues with SSO in my home lab.
Intermittently and for no obvious reasons SSO was refusing to cooperate and I wasn’t able to authenticate against vCenter. All I was left with was the following error message.
The fast and dirty workaround was to restart SSO (Windows) service, but you’d all agree that’s not a professional approach. My problem was – I was really getting out of time before the exam, also my home lab was really resource constrained (and I had a suspicion that these problems are caused by too small amount of RAM I’d assigned to SSO (Windows) server). I didn’t want to “waste time” for proper troubleshooting and I was afraid that the only solution would be to increase the RAM for SSO virtual machine (which was simply impossible), so I decided to write a short PowerCLI script that would periodically check the SSO status (by simply attempting to connect to my vCenter) and restart the SSO service if needed.
Before I show you this script, let me stress out that SSO is way too important component of VMware Infrastructure to accept even intermittent problems like the one I just described. If you’re having any issues with SSO (especially outside your home lab), you should troubleshoot it properly, preferably with assistance of VMware support. In fact I am not recommending scripted solutions of this kind (check service status, restart if needed) for anything outside home lab and for any period of time longer than few days before your exam ;).
<# .SYNOPSIS Connects vCenter every 10 to 12 minutes (randomized) to troubleshoot SSO .DESCRIPTION trivia .PARAMETER <paramName> n/a .EXAMPLE trivia C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -PSconsolefile "d:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI\vim.psc1" "& 'd:\tools\scripts\sso_watchdog.ps1'" #> Function Write-And-Log { [CmdletBinding()] Param( [Parameter(Mandatory=$True,Position=1)] [ValidateNotNullOrEmpty()] [string]$LogFile, [Parameter(Mandatory=$True,Position=2)] [ValidateNotNullOrEmpty()] [string]$line, [Parameter(Mandatory=$False,Position=3)] [int]$Severity=0, [Parameter(Mandatory=$False,Position=4)] [string]$type="terse" ) $timestamp = (Get-Date -Format ("[yyyy-MM-dd HH:mm:ss] ")) $ui = (Get-Host).UI.RawUI switch ($Severity) { {$_ -gt 0} {$ui.ForegroundColor = "red"; $type ="full"; $LogEntry = $timestamp + ":Error: " + $line; break;} {$_ -eq 0} {$ui.ForegroundColor = "green"; $LogEntry = $timestamp + ":Info: " + $line; break;} {$_ -lt 0} {$ui.ForegroundColor = "yellow"; $LogEntry = $timestamp + ":Warning: " + $line; break;} } switch ($type) { "terse" {Write-Output $LogEntry; break;} "full" {Write-Output $LogEntry; $LogEntry | Out-file $LogFile -Append; break;} "logonly" {$LogEntry | Out-file $LogFile -Append; break;} } $ui.ForegroundColor = "white" } #variables $ScriptRoot = Split-Path $MyInvocation.MyCommand.Path $StartTime = Get-Date $StartTimeStr = $StartTime.ToString("yyyyMMddHHmmss_") $logdir = $ScriptRoot + "\SSOCheckLogs\" $transcriptfilename = $logdir + $StartTimeStr + "sso-check_Transcript.log" $logfilename = $logdir + $StartTimeStr + "sso-check.log" $vcenter_srv = vcenter.seba.local [int]$interval = 600 [int]$delta = 0 [long]$script_uptime = 0 #test for log directory, create if needed if ( -not (Test-Path $logdir)) { New-Item -type directory -path $logdir | out-null } #start PowerShell transcript #Start-Transcript -Path $transcriptfilename Write-And-Log $logfilename "SSO watchdog startup" 0 "full" #load PowerCLI snap-in $vmsnapin = Get-PSSnapin VMware.VimAutomation.Core -ErrorAction SilentlyContinue $Error.Clear() if ($vmsnapin -eq $null) { Add-PSSnapin VMware.VimAutomation.Core if ($error.Count -eq 0) { write-and-log $logfilename "PowerCLI VimAutomation.Core Snap-in was successfully enabled." 0 "full" } else{ write-and-log $logfilename "Could not enable PowerCLI VimAutomation.Core Snap-in, exiting script" 1 "full" Exit } } else{ write-and-log $logfilename "PowerCLI VimAutomation.Core Snap-in is already enabled" 0 "full" } #$creds = $Host.UI.PromptForCredential("vCenter authentication dialog","Please provide credentials for $vcenter_srv", "", "") while ($true) { $error.clear() #let's rotate logs, just because we can and let's rotate them based on internal counter and not checking the date all the time ;) if ($script_uptime -gt 86400) { write-and-log $logfilename "Script running for more than 24 hours, rotating the log" 0 "full" rename-item -Path $logfilename -NewName $("Archive-" + $(Split-Path -leaf $logfilename)) $timestampStr = get-date -Format "yyyyMMddHHmmss_" $logfilename = $logdir + $timestampStr + "sso-check.log" write-and-log $logfilename "Log successfully rotated" 0 "full" $script_uptime = 0 } #let's wait at the beginning of loop - thus we'll avoid some false positives after server boot $delta = $interval + $(get-random -minimum 0 -maximum 121) write-and-log $logfilename "Sleeping for $delta seconds" 0 "full" start-sleep -seconds $delta $script_uptime += $delta #Connect to vCenter Connect-VIServer -Server $vcenter_srv -ErrorAction SilentlyContinue | Out-Null if ($error.count) { #let's not freak out after first failure, log the exception and just wait a minute. write-and-log $logfilename $error[0].exception $error.count $error.clear() Start-Sleep -Seconds 60 $script_uptime += 60 write-and-log $logfile "Re-trying log-on attempt after 1st failure" 0 "full" Connect-VIServer -Server $vcenter_srv -ErrorAction SilentlyContinue | Out-Null if ($error.count) { #mkay two consecutive failures, let's check the service status. write-and-log $logfilename $error[0].exception $error.count $error.Clear() $sso_srv = get-service if ($error.count){ #ooeps, can't even get service information, something is really wrong (with windows authentication?) write-and-log $logfilename "Impossible to retrieve SSO service status, the exception is:" 1 write-and-log $logfilename $error[0].exception $error.count Write-And-Log $logfilename "SSO status unknown, no action taken, waiting for next iteration" 1 } else { #mkay SSO, what's your status? $sso_srv = $sso_srv | where-object {$_.name -eq "ssotomcat"} if ($sso_srv.status -eq "Running") { #if running - restart restart-service $sso_srv.name | Out-Null write-and-log $logfilename "2 consecutive log-on failures, SSO restart" 1 } else { #if stopped - start it if ($sso_srv.status -eq "Stopped") { start-service $sso_srv.name | Out-Null write-and-log $logfilename "SSO not running, SSO start attempt" 1 } else { #if anything else (like "Pending start") - just log and wait a little more write-and-log $logfilename "SSO status is $($sso_srv.status), no action taken, waiting for next iteration" 1 } } } } else { write-and-log $logfilename "vCenter server $vcenter_srv successfully connected on 2nd attempt" $error.count "full" write-and-log $logfilename "Script waiting time is $script_uptime seconds" 0 "full" } } else { write-and-log $logfilename "vCenter server $vcenter_srv successfully connected" $error.count "full" write-and-log $logfilename "Script waiting time is $script_uptime seconds" 0 "full" } Disconnect-VIServer -Server $vcenter_srv -Confirm:$false -Force:$true -ErrorAction SilentlyContinue | out-null }
This script starts “infinite” loop in Line 101, then inside the loop tries to connect to pre-definied ($vcenter_srv variable) vCenter server every 10 to 12 minutes (I wanted to randomize the interval between attempts somehow, just in case the problem was one of these “self-restarting service” issues). What happens next is not much more than checking $error variable and writing it down in the log. After first failed attempt the script takes no action – it just waits another minute. If 2nd attempt is unsuccessful the script tries to check the status of SSO service (aka ssotomcat).
Please note this step can fail too – and if you’re running this script (like you should) on server local to SSO it means you’re in serious trouble at the Windows authentication level.
Depending on the status retrieved, the script will attempt to restart the SSO (if it is running), start it (if it is stopped) or again do nothing if the service is in one of the transition states like “Pending start” etc.
Because the script is intended to run for prolonged periods of time it also rotates its log every 24 hours, here I decided to use internal counter ($script_uptime) rather than invoke get-data after every iteration… just because I could ;).
OK – so at this point we have a script that will run until we kill its PowerCLI session, but it wasn’t enough for me.
I just didn’t want to “waste time” to log on to my vCenter/SSO server just to start this script every time I boot my lab, nor did I want to waste my precious RAM to leave RDP session opened there (for the script to survive).
Luckily Windows Task Scheduler offers an opportunity to trigger a task “At system startup” and making a task out of PowerCLI script is well documented process (that I will show you anyway in the few screenshots to come).
First of all – we need to make sure our task is running under the user (or better yet – service) account that has “Log on as batch job“ privilege and is allowed to log on to vCenter server.
Let’s also ensure the task can be run whether that account is actually logged on or not.
Secondly, we configure our task to be triggered at system startup (don’t forget to enable this trigger!)
Now, our action will be of course to start a program and this program will be PowerShell interpreter, then we will pass our script as argument, so the “Edit Action” window could look somewhat like that:
That’s not really readable, is it?
Well “Program / script:” textbox should contain something like that:
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe.
(path to where your powershell.exe is installed), while “Add arguments (optional):” textbox should contain path where you have your PowerCLI snap-in installed (together with path to the script itself):
-PSconsolefile “d:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI\vim.psc1” “& ‘d:\tools\scripts\sso_watchdog.ps1′”.
For your convenience I put both examples above in the script code itself.
At this point only small adjustments are needed, I prefer to start this task only if some network connection is available.
And make sure I can start my task manually.
This setup pretty much did the trick for me – my script was starting automatically every time I booted my lab, I had quite verbose log available for me to check at any given moment and I was sure that simple recovery actions (SSO restart) will happen w/o any manual action.
Of course the behavior of this script can’t be controlled via Services console, nor will it write entries to Windows Event Log, but it can be started and stopped via Task Scheduler console and has its own (text-based) log file that rotates every 24 hours. That’s good enough in my book to call it “almost service” :).
In this post I used automated restarts of failing SSO only as an example (not even very good one) how to make your PowerCLI script work in unattended mode. In fact you can use the “Task Scheduler trick” I described for any PowerCLI script that you want to be started together with your Windows server and work without your manual intervention.
I hope you will find this post useful, if so – feel free to share it.
As always – any constructive feedback is welcome!