We have recently upgraded from NSX-v 6.3.3 to NSX-v 6.4.3. The upgrade went well so far, but after the upgrade we noticed that the list of logical switches was no longer loaded correctly.
Depending on how we sorted the list, the list was either empty and the error message “Internal server error has occurred” was displayed or it happened while scrolling. In any case we couldn’t use the list with the logical switches anymore. And unfortunately the error message is quite useless.
So we went through the logs of the NSX Manager and found what we were looking for in the vsm.log.
To view the vsm.log you have to connect to the NSX Manager via SSH, go to enable mode and then switch to engineering mode (see: https://kb.vmware.com/s/article/2149630). The vsm.log can be found in the directory /home/secureall/secureall/logs/ .
In vsm.log the following error occurred while trying to scroll through the Logical Switches list:
2018-09-14 17:54:25.745 GMT WARN http-nio-127.0.0.1-7441-exec-30 RemoteInvocationTraceInterceptor:88 – Processing of VsmHttpInvokerServiceExporter remote call resulted in fatal exception: com.vmware.vshield.vsm.vdn.facade.VdnInventoryFacade.getAllUiVirtualWiresList
We did some research and stumbled across the following KB article, which matches exactly with our issue: https://kb.vmware.com/s/article/54442
This NullPointerException occurs due to obsolete entries in an NSX Manager DB table. When you access the Logical Switches page, a function checks all parent ESXi hosts of the Logical Switch VM member. And If a VM has a reference to a parent ESXi host that is not listed in the DB table, this error is triggered.
The above can happen due to VM templates. When a VM is converted to a template, an entry remains in the domain_object table that references the VM to the parent ESXi host on which it was located when it was converted to a template. If the ESXi host is later removed from the inventory, the entry is retained in the NSX Manager database and the problem occurs.
How could we resolve the Internal Server Error?
Okay. Here we go. Now that we know why the error occurs, we can find out which templates are responsible (we have a bunch of VM templates and I didn’t want to have to convert them all).
So we logged into the NSX Manager database:
psql -U secureall
And then we executed the following SQL query to display a list of all VMs with outdated entries in the NSX Manager database:
select objectid, host_id from domain_object where dtype='VimVirtualMachine' and host_id NOT IN (select objectid from domain_object where dtype='VimHostSystem');
This gives us a list with all stale entries:
objectid | host_id
vm-32044 | host-59994
vm-36338 | host-55344
vm-59817 | host-58241
vm-94263 | host-59994
vm-99187 | host-55344
vm-99400 | host-59819
vm-46179 | host-59819
vm-99257 | host-59819
vm-88091 | host-59819
vm-99682 | host-55344
vm-99725 | host-59819
Finally we found out the VM name via the VM MoRef ID in the MOB browser of the vCenter, for example:
And we converted the templates into VMs and converted them back into templates.
So the Internal Server Error was completely fixed and we got access to the complete list of Logical Switches again. Hooray!