During my years as a DBA I’ve come across ephemeral port exhaustion just a handful of times. It is rare enough that is not often considered, not widely understood and can cause extreme confusion when it does pop up. Understanding what port exhaustion is and why it can occur can make your job considerably easier if you should run into this problem.
Ephemeral, or dynamic ports are a port range used for short lived communications. When we connect to a server, we typically know the port we're connecting to. However, the local machine we're connecting from uses a random available port from its ephemeral port range to make the connection. When the communication is complete the connection will enter a TIME_WAIT state and after a default delay of 4 minutes it will be able to be reused. But what happens when so many connections are opened that all these ephemeral ports become used up? Well that’s when the fun, games and ephemeral port exhaustion begins!
The difficulty with diagnosing port exhaustion is that it can have a variety of symptoms. Sometimes these symptoms can actually be caused by a completely different problem such as a memory pressure or a handle leak. Regardless, if you experience one of more of the following symptoms you will want to investigate further.
- Network connectivity errors
- Inability to access fileshares
- Authentication issues
- High handle counts (a handle is needed for each port)
- Server appears unresponsive or unable to connect
- High numbers of connections in the TIME_WAIT state
- Memory errors for example: 10055 “An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.”
- Multiple SQL Server job failures, particularly jobs which run SSIS packages that can use up a number of connections.
Understanding and configuring for the ephemeral port range is vitally important. My first experience of this was when the team I was working with moved from Windows Server 2003 to 2008. They were especially diligent with their firewall, ensuring only the necessary ports were open. The trouble was during this switchover to 2008 the ephemeral port range changed. This meant that instead of 3976 ports there are now around 16384 ports, although you can change this to increase it further. This means port exhaustion is less likely, but as the range had completely changed our firewall was no longer allowing connections from the ports we needed!
Although this wasn’t port exhaustion the symptoms were the same, however the ways I would detect port exhaustion now wouldn’t have been especially helpful in this case.f you’re unsure the port range you are using, as it is possible to change it, you can run the following:
netsh int ipv4 show dynamicportrange tcp
The next time I saw these symptoms again was when I did encounter port exhaustion. The cause of the issue this time was that a .NET application which was not using connection pooling correctly. This was spinning up far too many connections, but not releasing these back to the connection pool. Effectively, this produced a denial of service, despite the server resources showing as healthy. This time the exhaustion was much more visible. When running netstat from the command prompt (which shows current active connections) it was clear thousands of connections were used up and in a TIME_WAIT state. Note the below isn’t just for ephemeral ports but will give you an idea
netstat -ano | findstr TIME_WAIT | measure
netstat -an | find /c "TIME_WAIT"
This state occurs after a connection has been closed but it is not yet available for reuse. The connections are cleared up after a specified time, the default being 4 minutes. In systems where a high number of TCP connections are opened and closed, it could be beneficial to amend this default so that ports become available sooner and in Windows Server 2012 it is possible to configure windows to reuse the connections without needing to wait for the delay at all. Amending this may improve connectivity in the short-term, but ultimately the solution is to look at the application and ensure that connections are properly deallocated after use.
Obviously running out of ports since 2008 is harder but an incorrectly coded application can still easily achieve this and in our case the only true fix was to get the development team to modify the application code. We did temporarily change the default connection timeout (TcpTimedWaitDelay) from 4 minutes to 30 seconds. This was not a solution but meant when the application was not under heavy load the website remained online.
Since my first encounter with port exhaustion, I’ve only seen similar issues a handful of times since. One of the more interesting occurrences was experienced by one of my colleagues, through no fault of the application or infrastructure. On this occasion it was a Windows bug that meant connections in a TIME_WAIT were not closed down even after the 4 minutes. The interesting thing about this bug was it occurred only after a server had been up for 497 days, which is a fair amount of time without any scheduled restarts and perhaps why I had not come across this sooner.
It also seems fairly common for jobs that run SSIS packages to be the first to start failing. SSIS packages, depending on the complexity, can start up a large number of connections which means they can often tip a heavily loaded server over the edge. There is a 'RetainSameConnection' property which defaults to false in an SSIS package which may be worth setting to 'true' but you need to be aware of the other impacts of this before implementing.
If you do think you are experiencing port exhaustion the first thing to do is to run netstat to give you an idea of the number of ports in use.
netstat -an | find /c /v ""
If it’s near the limit (assuming you know your dynamic port range) you may want to pipe it out to a file so you can investigate further
netstat –ano > “FileLocation.txt”
A reboot of the server will instantly, but temporarily, fix the issue. Ensure you’ve gathered enough information before you do this. Modifying the port range or decreasing the default timeout are options to help but be aware you should only do this if you understand the root cause, or are using it as a temporary fix while you investigate.
The root cause of ephemeral port exhaustion can often be elusive. However, being aware of the symptoms, the methods of diagnosing it and the existing bugs as well as common causes that may result in it go a long way in being able to provide a solid action plan towards resolution.
By ensuring latest server patches are applied, that our applications are correctly pooling connections, and that we have enough ports available for the connections required, port exhaustion can be a problem that we, hopefully, never need to face.
Please see the following articles for further reference:
Change of dynamic port range from Windows 2003 - 2008
How to change the port range or decrease the default time out
Windows 2012 auto reuse port range
Windows updates to look at applying
Find out more about how Coeo can support you: