PowerCLI report on storage paths (mis)configuration.

After drifting away into some niche topics in my previous post, I’m back to my “hobby@work” – PowerCLI scripting!

Anybody who ever worked with block storage (so FC or iSCSI) sooner or later came across the task of “managing” storage path configuration in the environment.

Whether you troubleshoot some performance issues (and storage is one of the “usual suspects”, right?) or you’re migrating to new SAN and want to verify the settings, or maybe you are top-notch IT professional
(like we all are 😉 ) that just does this kind of checks on regular basis – you don’t want to do it manually with vSphere/Web Client.
PowerCLI is the obvious answer and it is really easy to report these settings with get-scsilun and get-scsilunpath cmdlets. There is a great many scripts of this kind on the Web and once you get “per device” output you can easily(?) manipulate it with Excel or any other spreadsheet application you prefer.

Well, I’m not good with Excel at all and whenever I learn the basic moves around it, they introduce a new version and a new “learning curve” starts for me, so I needed my “pathfinder” script to do a little more than just “basic reporting”. I wanted things like co-relating canonical name (t10, naa, eui) with “human friendly” datastore name, I wanted extra information like number of paths that were disabled by administrator or which PSP has been selected for the device and most of all I wanted path configurations that “stand-out” to be visible immediately after I open the CSV (what else 😉 ) report, with no additional filtering in Excel required.

Luckily Gods of PowerShell blessed us with sort-object cmdlet which makes writing such script not really difficult.

Without much further ado – here it is:

I know it looks lengthy, but most of this “code” is to make it foolproof and chatty, or display progress bars and fancy counters ;).

I will try to focus on just a few lines, where all the “magic” happens.

First of all – I made an assumption that for the given VI container (typically a host cluster) all datastores are configured with “full-mesh” access, so each host can access all datastores. If you happen to have a set-up where there is a single host (in cluster) with some “extra” datastore (to copy some vms or whatever from different clusters, for example), this script will report it is as error for all vSphere hosts, except “the chosen one”. So if you have such config, just disable the condition in Line 188 (fix it to $true or something like that 😉 ).
Please note that this condition will kick-in also in situations where you have datastores that span multiple LUNs (more devices than datastores!) or if your vSphere hosts can see LUNs that are not VMFS datastores (also RDMs!). If at this moment you think this check gives you too much trouble already – fix it to $true now! ;).
I will also like to stress-out that I was always using this script for host clusters – just because this is a typical setup to me, to have uniform SAN zoning for a cluster. If you want, you can also (try to) use this script for the whole datacenter or folder of hosts/clusters/datacentres, just remember that above conditions still apply – if SAN zoning is not uniform between clusters you will get lots of “false positives” (unless you disable the check in Line 188).

Line 163 is where I gather the vSphere hosts that are connected or in maintenance mode, cause we can only report from hosts that are responsive.

In Line 169 I filter out local and NAS datastores (cause path configuration doesn’t apply to them) and create an array containing (datastore name, canonical name) pairs that will help me to resolve datastore names from canonical names later on. There is small caveat here, because to retrieve canonical name of device I’m using only the first element of ExtensionData.Info.VMFS.Extent array. That’s no big deal for me because of (in)famous check in Line 188, but if you have VMFS datastores that span multiple extents, you (again) need to fix this check to $true. The script will not be able to resolve datastore names for these additional extents, but it will still report discrepancies at device level, only with empty datastore name. (Do you still have multiple extent datastores? Really?)

Line 193 is the place where script gathers actual path configuration. I’m counting total number of paths, together with number of active and disabled ones. In a perfect world all hosts should have all these numbers equal. If there are discrepancies between number of paths that hosts can see – there is most likely problem with SAN zoning. If number of active paths is different between hosts, it can be either SAN issue or different PSP selected for the device (that’s why I also retrieve the latter).

I was also trying to fit “datastore name resolution” in Line 193, but somehow couldn’t get “nested $_ statements” working for me, that’s why I ended up with creating “path entries” for each device between Line 196 and Line 207. You might not like it, but I wasn’t able to figure-out anything more elegant, luckily these are simple operations that do not waste much time.

Line 224 is really important, cause all the duplicate “path vectors” are eliminated here, and we end-up with array of unique (canonical name, path-configuration) objects… But this isn’t even the final form of this information! 😉

Now, don’t get confused with what happens between Line 233 and Line 255, cause it is really simple. The script first looks up for duplicate canonical names. If it founds any – it means that the same block device has different path configuration between hosts, so we’ve got “some issues”. If you are lucky there is only one host that “stands-out”, but in worst case scenario each host can have different “path vector” in the fabric (sic!).
Obviously, the script can not decide which configuration is correct (it is still for the human to decide, right?). That’s why between Line 236 and Line 243 the script goes back to “raw data” (meaning: unsorted array), to retrieve entries for all detected path configurations for device in question. It may sound complicated but again, not too much time is wasted for these operations (as I will show you later).

The final result (in CSV report) is that for each block device that has uniform configuration in cluster we get a single entry (with OK status and AllVMHosts as “VMHostName”) and for anything non-uniform, we get a list of all hosts from cluster with path configuration specified for each host. And this is exactly what I wanted to achieve – immediately after opening the report I can see where the issues are (NOK status!) and to which hosts I should go to fix them (once I recognize which of different path configurations is correct B) ).

Here is a example report from my home lab:

As you can see I’ve got a real mess there, disabled paths, different PSPs configured, there is a lot of “gardening” for me to work on ;).

I would also like to show you a screenshot showing this script in some “real life” action, I hope you understand I had to “anonymize” most of the information 😉

check_paths_in_action

As you can see the report for a cluster with 8 vSphere hosts and 35 (FC) datastores was created in something like 12 and a half minutes. It is not really blistering speed, but the script spends most of the time querying the hosts for LUN and path information. I could probably speed this process up by using PowerCLI views, but… views are something I still need to learn about, cause I don’t feel very comfortable around them ;).

Also the sorting and resolving hosts and datastores part could probably be “coded” better, but it only takes seconds compared to sequence where information is gathered, so I wouldn’t worry too much about it.

A word on explanation on why I decided to create this (datastore name, canonical name) array just to resolve datastore names. Well, initially I was trying to start this script with get-datastore cmdlet to retrieve all datastores from the given VI container (and then pipeline collection of datastores to get-scsilun and get-scsilunpath cmdlets), instead of iterating through vSphere hosts to look for LUNs. If I were successful with that – this “name resolving array” wouldn’t be needed of course.
Unfortunately not only it took longer (as you can see querying a host for LUNs takes about 90 seconds, querying a datastore for luns and paths lasted at least 3 minutes per datastore), but also it just didn’t work for iSCSI luns (get-scsilun cmdlet just got hung with neither output nor errors when provided with a iSCSI datastore as pipeline input – at least in every environment I had opportunity to test this script).

I hope you will find this script useful, but any feedback is welcome 😉
Also – be sociable and share!

Sebastian Baryło

Successfully jumping to conclusions since 2001.

  • hrooney

    This looks like exactly what I needed, unfortunately it fails to produce the csv file at the end with no indication why, just “empty report”. It DOES produce log files.

    >>> ERROR <<>> ERROR <<< [2014-12-30 07:03:03] Total of 21 hosts and 90 datastores checked in 543.94s, 22 ERRORS reported, exiting

    • sbarylo

      Hi! The error above is result of “IF” checks in line 188 and 219.
      Generally speaking this script works well if all hosts in “VI Container” have equal access to all datastores (so I usually run it against HA/DRS cluster where, by VMware best practice you should have homogenous access to LUNs/datastores).
      If (I’m guessing now) you run it against whole VI datacenter where (say) you have two clusters, with different SAN zoning this might be reason for this errors.