Understanding congestion in vSAN

If things go wrong in a VMware vSAN 6.x environment you may see congestion warnings and errors – usually in hybrid setups. I try to explain what congestion means in the world of vSAN, why it’s a problem and how to deal with it.

What is congestion?

In short, congestion occurs when you have a bottleneck in the lower layers of your vSAN infrastructure. Or, more concretely: The destaging process of the incoming data, from the flash cache to the capacity disks, is so heavy that one layer of vSAN cannot handle this amount of data.

And the long answer is:

Congestion is a flow control mechanism of vSAN that reduces the rate of incoming I/O at the input layer. This is done by creating an delay which is exactly the same as that would occur at the bottleneck (for example at the phyiscal disks). Thus it’s an effective way to shift the latency back to the ingress.

The congestion runs through the entire stack. From the bottleneck to the ingress. In this way, the upper layers can perform an ingress throttling and the congestion will also be perceived in the upper layers. In the root cause analysis it is therefore important to find the lowest level at which the congestion occurs.

The congestion values are integer and the range from 0 to 255 (0 = good, 255 = bad) and are calculated by a randomized exponential backoff method based on the congestion measurement.

Are there different types of congestion?

Yes. Depending on the layer at which the bottleneck occurs, a distinction is made between the following types:

  • SSD congestion
    If the size of the write buffer of the caching disk exceeded a threshold then a SSD congestion occurs.
  • IOPS congestion
    This happens when the number of incoming I/Os is too large for the physical disks.
  • Log congestion
    If the internal log space runs out of space, a log congestion will happen.
  • Memory congestion
    This occurs when a memory heap of a vSAN layer exceeds its threshold value.
  • Component congestion
    If the size of an internal component table of vSAN exceeds a threshold limit, this kind of congestion occurs.
  • Slab congestion
    This is the case if the number of operations exceeds the capacity of the operation slabs.

Each of these congestion types is associated with a specific resource that can be reclaimed. That means that the congestion warnings and errors disappear when the cause of the congestion dissolves.

Why is it a problem?

Having congestion somewhere in the storage infrastructure is not a good thing. And because it’s an I/O throttle mechanism, the result is higher latency in your virtual machines and applications. You will see that with a higher CPU load (due to IO wait) your applications run more unstable.

Usually congestion occurs when either something is not properly sized or there’s an unintentional overload (for example a Denial-of-Service attack). But this may also indicate incorrect sizing, depending on the circumstances.

In any case, it’s a problem. The worst case I’ve seen so far was a cluster where all ESXi hosts stopped responding and the virtual machines were marked as inaccessible. The result was a total failure of the entire cluster and all virtual machines.

So be careful with this kind of error!

How can you view the current congestion level und the congestion limits?

The easiest way is to look at the vSAN – Health tab in the vSphere Web Client. Simply select the vSAN cluster on the left, go to the “Monitor” tab, select “vSAN” in the sub-tab and click on “Health”. In the “Physical disk” section you can see the different types where congestion can occur (SSD, slab, log, memory and component congestion).

vSAN congestion health tab

If you prefer the CLI method, try these commands on the ESXi hosts in the vSAN cluster:

for ssd in $(localcli vsan storage list|grep "Group UUID"|awk '{print $5}'|sort -u);do \
echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion; \
done

This shows the congestion metrics for each disk group (LSOM). And there is also another interesting value. What’s the current size of the LLOG and PLOG:

for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do \
llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}'); \
plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}'); \
llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}'); \
plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}'); \
allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}'); \
echo $ssd;echo " LLOG consumption: $llogGib"; \
echo " PLOG consumption: $plogGib"; \
echo " Total log consumption: $allGibTotal"; \
done

For the handling of I/O requests and data, memory and log management there are also some advanced configuration values in vSAN:

esxcfg-advcfg -g /LSOM/lsomSsdCongestionLowLimit
esxcfg-advcfg -g /LSOM/lsomSsdCongestionHighLimit

These two commands output the threshold values for memory. And with the following commands you can see the threshold values for SSD and Log congestion:

esxcfg-advcfg -g /LSOM/lsomMemCongestionLowLimit
esxcfg-advcfg -g /LSOM/lsomMemCongestionHighLimit
esxcfg-advcfg -g /LSOM/lsomLogCongestionLowLimitGB
esxcfg-advcfg -g /LSOM/lsomLogCongestionHighLimitGB

What can you do when you are faced with congestion?

This depends on the layer at which the congestion occurs and what causes the congestion. Maybe it’s a Denial-of-Service (from attackers or application errors) and it’s not expected. In this case find out what’s responsible for the high amount of IOPS and solve the problem.

If it is not an unintentional amount of IOPS and you are using a vSAN hybrid setup, it is usually the destaging process where the data from the fast caching layer is destaged to the slower capacity layer.

In this situation there is not much what you can do because it’s a physical problem in the mechanics how magnetic disks work. The simplest and most expensive solution would be to purchase flash devices, so that you have an all-flash configuration. If the capacity tier is as fast as the caching tier, the destaging process is no longer a problem.

Another solution in a hybrid setup is to buy more and better hard disks for the capacity tier. The higher the RPM of the magnetic disks, the more data can be written in the same time (this also depends on the seek time). But there is also a downside: The higher the RPM, the lower the capacity of the disk.

And if you have space available in your server, you could also add more capacity disks. In this way, the vSAN objects are distributed more evenly across all available disks in the disk group and the write stress is also distributed more evenly. This helps to prevent congestion.

Is it the same with All-Flash configurations?

With All-Flash configurations, congestion is less common and the solutions mentioned are mainly intended for hybrid setups. All-Flash configurations do not have slower capacity disks and generally offer superior IO performance.

If congestion occurs in All-Flash configurations, it is either due to incorrect sizing or buffers fill up. In these cases, the sizing must also be reconsidered or the cause of the high IO load must be eliminated.

In vSAN environments with heavy write load it’s not uncommon to use larger caching disks (more than 10% of the usable disk group capacity) and to designate a second cache device.

Update from 27 October, 2018:
See comments

Temporarily it can also help to change the vSAN advanced config to resolve the congestion at short notice. However, these parameters should only be changed after consulting VMware support. In the worst case they can lead to data loss!

– Commands deleted –

Adjusting the congestion thresholds for Log, Memory and SSD buffers is very easy. Just run these commands on all ESXi hosts in the vSAN cluster:

That way you can increase the thresholds for Log, SSD and Memory which are integer and in GB. The low limits are soft limits and at the high limits the throtteling is enforced.

What else can I do?

Another approach is to limit the incoming IOPS. This can be done through a storage policy that you can assign to some or all virtual machines. Or you can set an individual IOPS limit on specific virtual disks in the vm settings dialog. But this slows down the heavy hitter VMs.

Theoretically, it’s the same as the congestion mechanism. It creates a delay at the source and thus limits the total I/O throughput. But it happens in a controlled way and does not affect all virtual machines equally.

Sometimes there are only a few objects that are responsible for congestion problems. If this is the case, it will not help to increase your disk capacity. Because each virtual disk (VMDK) is mapped to one single vSAN object. This object can be striped over a few disks (depending on the size of the VMDK and vSAN stripe width) but only a few capacity disks are involved in this situation. And adding more disks won’t necessarily distribute these objects more evenly.

A few final thoughts

If you consider to upgrade your hardware, please keep in mind the vSAN sizing recommendations from VMware:

  • At least 1 cache disk in every disk group.
  • A maximum of 7 capacity disks per disk group.
  • And no more than 5 disk groups per ESXi host.
  • The size of the caching disk should be 10% of the usable capacity. It’s calculated as follows: (Raw Capacity / FTT ) * 0,1

(see https://storagehub.vmware.com/t/vmware-vsan/vmware-r-vsan-tm-design-and-sizing-guide/)

If you can’t identify the source of the congestion or are unsure how to solve the congestion issue, do not hesitate to contact the VMware GSS. They have great vSAN specialists and know how to identify the root cause and what you can to do.

Please do not try to change the advanced parameters in a production environment by your own without consulting the VMware support. In the worst case you will lose all data!

 

Sources:

 

2 Replies to “Understanding congestion in vSAN”

  1. Hello,

    I would advise you remove the incorrect and potentially dangerous information you have placed here:
    “That way you can increase the thresholds for Log, SSD and Memory which are integer and in GB. The low limits are soft limits and at the high limits the throtteling is enforced.”
    This is incorrect and unfortunately indicates that you don’t understand what functions these commands/processes serve.

    “esxcfg-advcfg -s 125 /LSOM/lsomSsdCongestionLowLimitGB
    esxcfg-advcfg -s 150 /LSOM/lsomSsdCongestionHighLimitGB
    esxcfg-advcfg -s 125 /LSOM/lsomMemCongestionLowLimitGB
    esxcfg-advcfg -s 150 /LSOM/lsomMemCongestionHighLimitGB”
    The only saving grace here that is preventing people who stumble upon the above and making a bad situation worse is that none of these commands will change any parameters as they are incorrect in 2 places.

    Thanks at least for your warning at the end of the article but both you and I know there are people out there that will set random parameters they see on the internet without reading to the bottom of the page.

    Bob

    1. Hello Bob,

      Thank you very much for your feedback. I really appreciate that. And you are absolutely right. The commands are actually wrong because the keys don’t exist (copy-paste error) and because the values are above the possible maximums (I wasn’t attentive enough here).

      I understand that the congestion mechanism is a flow control that is supposed to move the IOs back to the source when a congestion occurs at lower levels. To detect if a congestion occurs, there are thresholds for the usage of the SSD write buffer, LSOM memory and log size. If these thresholds are exceeded, the IOs are limited and the congestion shifts back to the source.

      Basically, I think this is a very intelligent way to deal with such an overload and I also think the thresholds are well chosen. However, if the infrastructure is severely affected due to congestion and you can hardly manage it anymore, you might want to resolve the congestion as quickly as possible and not particularly gently.

      My approach was therefore to raise the thresholds in order to be able to resolve a temporary congestion more quickly. Or in other words: As much as the hardware can handle.

      Be that as it may, I’ve changed the section now, because in such situations it might be better to consult the VMware support.

      Thank you again for bringing this to my attention.

      Best regards,
      Sebastian

Leave a Reply

Your email address will not be published. Required fields are marked *