Identifying cause of too many CLOSE_WAIT in IIS [migrated]
I have a windows server running a web api that serves an android app, and today I started getting alarms saying that my server was timing out.
This server is running behind Cloud Flare.
When I connected to the server via RDC, I noticed that it was using 0% of CPU but had more than 3200 connections as can be seen here: connections
The "normal" amount of connection would be something close to 300. So it was 10x more.
I thought it was under attack and then I activated the "I'm under attack mode" from cloudflare but it didn't work at all.
I restarted IIS by running iisreset and it came back to normal for a few minutes, then the number of connections started increasing again!
I jumped in Cloud Flare support chat and the support agent said he was not seeing anything out of ordinary and there was nothing they could do.
My server allow only connections from CF servers.
I decided to check what those connections were and when I ran netstat, I got this:
Active Connections
Proto Local Address Foreign Address State
TCP xxx:80 CF_IP_ADDRESS.157:13824 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.157:17952 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.173:21754 ESTABLISHED
TCP xxx:80 CF_IP_ADDRESS.173:22890 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.173:24456 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.173:55678 ESTABLISHED
TCP xxx:80 CF_IP_ADDRESS.173:63352 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.195:31634 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.195:56504 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.195:62466 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.205:14264 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.205:37858 ESTABLISHED
TCP xxx:80 CF_IP_ADDRESS.205:47142 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.205:50318 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.205:57534 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.205:63570 ESTABLISHED
TCP xxx:80 CF_IP_ADDRESS.211:35054 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.217:26940 ESTABLISHED
TCP xxx:80 CF_IP_ADDRESS.217:29042 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.217:37898 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.217:39096 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.217:46002 CLOSE_WAIT
TCP xxx:80 CF_IP_ADDRESS.217:63860 CLOSE_WAIT
this is just a few lines taken from 3622 lines.
The interesting part is that from these 3622 lines, 2992 had this CLOSE_WAIT as the state.
As I said, if I ran iisreset, everything would work as normal for a few min before starting to timeout to genuine users of the app.
CF support said they couldn't see anything out of ordinary so I'm not sure if this was an attack or what.
The server is running IIS, could it be a bug somehow? Is there any attack that follows this pattern and would leave a lot of CLOSE_WAIT connections?
Any help would be really appreciated.
The server is running Windows Server 2016 and IIS 10.
OK I will post my findings here, just in case anyone needs it.
Around 10 hours before this issue started to happen, I had ran windows update and KB5005698 was installed. This update was installed on the 2 servers that support the android app.
Weirdly enough, the issue started at the same time on both servers, that's why I initially suspected it was an attack.
When the server wasn't on high load anymore, the issue stopped and I decided to migrate the web api from .net 5 to .net 6, I installed the server bundle and deployed it.
As the issue stopped before migrating .net version, nothing had changed so I just left it there.
Around 4 hours ago, I started getting alarms again, but this time it was because the web api was returning excessive http 500, but the number of connections were normal. So I decided to revert the app to the .net 5 version.
As soon as I did that, the number of connections started to increase and reached 5k more in just a minute and the timeouts were running free! I kept running iisreset and the same pattern was happening again.
So I swapped it again to .net 6 and no more connections increase but http 500s after a while.
Turns out the http 500 was an easy code fix so I fixed it and deployed again, targeting .net 6.
So no more high connections and everything seems to be working smoothly.
So I came to the conclusion that the issue is with KB5005698 and .net 5.
Deploying the same app targeting .net 6 fixed the problem.
After thousands of bad reviews and loss of revenue, it's all back again...
Lesson learned... I will never update the server again if I don't need to.
Hope it helps someone.