We had an interesting support issue within our own network on New Year’s Eve. The developer in charge tells his story below:
Right after 1:00am CET, I started to receive notifications that the Resource Limit matched the network for an average load >10. First I thought it was a scheduled task, but quickly realized that nothing that “heavy” would be scheduled at midnight (GMT). So I decided to stop watching my boring movie and look into what was going on.
I found a lot of failed calls with the same reason – that CAC was exceeded. I also realized that there were a lot of calls, even though it was New Year’s Eve and most of our customers businesses were closed for the holidays.
The top command showed all FreeSWITCHES processes and mysqld had >100% CPU usage. It was a mystery because there were no hard SQL queries and even FreeSWITCH had zero calls and the CPU was >100%.
First, I restarted the FreeSWITCHES one-by-one and tried different switches for the RT clock and clock sync. Nothing helped. FreeSWITCH compiled 2 years ago and nothing changed. How had it passed last New Year’s Eve?
The reason that FreeSWITCH was rejecting calls was a constraint that the minimum idle CPU must be 5% to ensure transcoding quality. I decided to disable this and left 3 FreeSWITCHES working to keep the load less and allow calls to pass. FreeSWITCH would occupy 3 cores and have room for calls. I tried several calls and had no problem with quality. It seemed that would be enough while I tried to find the problem and solve it.
This is how it looked in the morning when I finally had to get sleep, unable to solve the problem.
Later that day I decided to try a new FreeSWITCH 1.6 and 1.8. I compiled, prepared, and tried, but it did not help. The CPU usage was >100% again. Then I compiled a debug version and added it as the fourth FreeSWITCH in production to get the core dump and figure out why the idle FreeSWITCH was eating the CPU. I realized that FreeSWITCH had a problem with timers. I tried a different one, but it did not work. There were problems with the nanosleep function in FreeSWITCH, but it was fixed already. Somehow FreeSWITCH squeezed time and jumped into a loop.
Then I checked the time on the host, it was correct. Even /proc/interrupts looked correct.
My last option was to try adjusting the time on the host and sync with NTP. Restarting the ntp-client failed. Fortunately it was a problem with the DNS. Actually, the host was unable to access our local DNS resolver. I changed it temporarily to an external nameserver to resolve the NTP server and correct the time.
At the same moment when the NTP finished syncing, all processes went back to <5% CPU usage.
I realized that it was a problem with the leap second bug on an old kernel and it affected FreeSWITCH and MySQL, but it was corrected with NTP when NTP synced the time.