Two of our vSAN clusters consist of VxRail S570 nodes with Intel X710 NICs. A few weeks ago the ESXi 6.5 hosts failed again and again. Not all at once but a different host every time. Failed means the host was displayed as “not responding” in the vSphere client and VMs stopped running and were restarted by vSphere Availability on other hosts.
Of course we thought of a network error at first, but on the physical upstream switches there were no related events in the logs and also all other hosts were not affected.
The problem was very sporadic and we couldn’t make any sense of it. In addition a reboot of the host solved to issue for a short time.
So, we analyzed the logs and could see APD and PDL (“all paths down” and “permanent device loss“) errors at first. In fact, ESXi lost the connection to the other vSAN hosts but it still wasn’t clear why.
The problem in a nutshell
After analyzing and debugging deeper and deeper, we could find “malicious driver event” messages in vmkernel.log just before an outage occured:
2018-06-17T17:41:52.760Z cpu50:66340)i40en: i40en_HandleMddEvent:6829: Malicious Driver Detection event 0x02 on TX queue 0 PF number 0x01 VF number 0x00 2018-06-17T17:41:52.760Z cpu50:66340)i40en: i40en_HandleMddEvent:6855: TX driver issue detected, PF reset issued
I didn’t know this error message before, so I started a search on the net.
And I noticed that these error messages are Intel specific and mainly affect the X710 and XL710 chipsets (and some others). But since Intel X(L)710 NICs are very common, all major server manufacturers are affected. For example Dell, HPE, Lenovo and several others.
In addition, these error messages occur in both ESXi 6.0 and ESXi 6.5 and it seems this “malicious driver detection” (MDD) often cause problems in many setups (not only at VMware infrastructures).
However, the most important question for me was:
What is this “malicious driver detection” and what exactly does it?
I found a short description in an Intel datasheet for another NIC model:
“Malicious behavior exhibited by the driver can be a result of incorrect activation of the network controller or a virus on a certain VM. The 82580EB/DB contains internal circuitry to protect from an
attack on one virtual machine disrupting operation of other virtual machines. When malicious driver behavior is detected on a certain queue, the 82580EB/DB disables activity of the queue and sends
notification to the VMM.”
“18.104.22.168.2 Interrupt on Misbehavior of VM (Malicious Driver Detection).
The hardware can be programmed to take some action as a result of some misbehavior of a VM. These actions might hint to the fact that some VM is malicious and the VMM should remedy the situation.
On detection of a malicious driver event the 82580EB/DB stops activity of the offending queue, asserts relevant bit in the MDFB.Block Queue field and generates an interrupt by asserting the ICR.MDDET bit.
As always, it’s all about security… Yay!
Well, the next question is: What can we do if we are faced with a malicious driver event?
Maybe a solution for malicious driver event messages
Unfortunately, there is no simple solution. Disabling TSO and LRO on the ESXi 6.x hosts may help but not under certain circumstances.
TSO stands for “TCP Segmentation Offload” and causes the NIC to take over TCP segmentation and divide larger amounts of data into TCP segments. Without this feature, the CPU has to segment TCP/IP packets. LRO stands for “Large Receive Offload” and this feature reassembles incoming network packets into larger buffers so that the CPU has to process fewer packets.
In any case, it is basically a good idea to start with it. TSO and LRO don’t make things easier and disabling these two features can improve stability. The disadvantage is that the CPU load on the ESXi host increases slightly.
To disable TSO and LRO perform the following procedure on each host (one at a time):
1.) Put the ESXi host into maintenance mode
2.) Login to the ESXi Host via SSH
3.) Execute the following commands in the ESXi cli:
esxcli system settings advanced set -o /Net/UseHwTSO -i 0 esxcli system settings advanced set -o /Net/UseHwTSO6 -i 0 esxcli system settings advanced set -o /Net/TcpipDefLROEnabled -i 0
4.) Reboot the host
5.) Exit maintenance mode
For more information about TSO and LRO, just have a look at this VMware knowledge base article: https://kb.vmware.com/s/article/2055140
What else can you do?
If this procedure doesn’t help to eliminate the malicious driver event, the only way is a driver upgrade. And that’s sometimes a mess.
The good news: The official statement from Intel is that this problem was fixed in the i40en driver with version 1.7.1 in late May 2018. Sounds good, doesn’t it? The short answer is: Not necessarily.
To be more concrete: For supported VMware configurations you have to consult the Hardware Compatibility Guide (HCL) from VMware:
Or for vSAN setups, including our VxRails, the vSAN HCL is mandatory:
This list contains all supported hard- and software configurations and among other things, it explicitly prescribes which driver versions you have to use.
By all means, you need to check first that the i40en driver in version 1.7.1 is supported for your hardware. And that’s only the case for ESXi 6.7. For ESX 6.5 the highest supported version is 1.5.8.
Therefore, you must upgrade to ESXi 6.7, or you can wait until VMware updates its HCL and supports the 1.7.1 i40en for ESXi 6.0 and 6.5 in the future.
For us, the situation is very clear. Our ESXi version is bound to the VxRail supported versions and ESXi 6.7 isn’t supported yet by Dell EMC. Hopefully in the next weeks there will be an upgrade path to VxRail 4.7.x which includes ESXi 6.7.
So, keep your fingers crossed in any way.
Another important note:
There is a newer version of this Intel driver: i40e (version 2.0.7). But with ESXi 6.7, this new driver may result in a PSOD. See: https://kb.vmware.com/s/article/2126909
Update – 7th October 2018:
In the meantime, VMware has released i40en driver version 1.7.11 for ESXi 6.0 and 6.5, and also for 6.7. This version addresses the MDD problem and contains more bug fixes. I therefore recommend to use this driver for all Intel X710, XL710, XXV710 and X722 network cards.
- Malicious Driver Detection (MDD) event handling
- Unable to enable SR-IOV via Web Client
- PSOD when booting a Supermicro X710DAi server
- Link is not being detected while toggling Promiscuous Mode on a VF interface
There are some other known issues and workarounds described in the release notes. Please check this file for further information.