Web Service performance testing, monitoring, and troubleshooting

The IA team is wrapping up our first round of performance and scalability testing for Intelligent Authentication 1.1 here at Corillian and I’ve got to tell you this thing performs! In the past I’ve seen hints and rants on the Internet about .NET Web Services performance being slow which made me weary of what I was going up against for performance testing. I’ve got to tell you that .NET Web Services flat out SCREAM! What is “flat out SCREAM”? I’m talking about response times that are one-tenth of a second on a loaded web server (CPU at 70%) .. and a little over two-tenths of a second when the CPU averages 98% (about ready to tip over and catch fire). Getting there was a bit of a challenge but we’re there. Whew! I think we all learned a lot. I learned some pretty technical and confusing stuff along the way. Two big things taught and learned: Threading and Web Service performance counters. Here are a list of BIG hurdles and how we got over them:

1st hurdle
At about 80 method requests per second the Web Server started returning 503 errors to SilkPerformer. Method requests were receiving the error:

“[HttpException (0x80004005): Server Too Busy]    System.Web.HttpRuntime.RejectRequestInternal(HttpWorkerRequest wr) +148”

1st hurdle fix
Tune the Machine.config file to the Microsoft performance recommendation for Web Services:
Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications
Understand what it all means with:
Chapter 17 – Tuning .NET Application Performance

2nd hurdle
At about 115 method requests per second the Web Server started returning 503 errors to SilkPerformer. Method requests were AGAIN receiving the error:

“[HttpException (0x80004005): Server Too Busy]    System.Web.HttpRuntime.RejectRequestInternal(HttpWorkerRequest wr) +148”

But this time we things were a bit different. The Machine.config settings were set to recommended values and the actual number of threads were maxed out too (maxWorkerThreads and maxIOThreads were both set to the limit 100). I asked the Corillian Scalability team if they had ever seen such a thing and low and behold they had. Turns out that when they did the Voyager 70,000 concurrent users test at the Microsoft Scalability Lab a couple years ago they ran into the same issue.

2nd hurdle fix
According to our friends at Microsoft (an MS Engineer in the scalability lab) you need to change the Machine.config default value for appRequestQueueLimit from 100 to 5000. Bam! Issue fixed. We moved on. The setting is probably a little overkill, but the actual setting for you will vary depending on your hardware. Five thousand will nearly guarantee that this setting won’t be your bottleneck anymore.

3rd hurdle
The Web Severs were only processing 150 method requests per second no matter how much load we put on them. We had a bottleneck somewhere but couldn’t seem to find it. Adding various counters revealed that the ASP.NET request queue was pretty “spikey” and sometimes constantly around 100. The more load we put on the higher the queue and the higher the response time. In retrospect this was counter was my obvious clue but I just didn’t know enough at the time.

3rd hurdle fix
The fix ended up being a thread limit we had set in our Web Service. The thread limit to write to our Auditlog in SQL was set to 5. Bumping this up solved the issue. Twenty-five ended up being the perfect number for our hardware. Pouring over Microsoft’s performance recommendations several times and trying different Machine.config settings with no avail left me staring at the following picture only to walk through the application flow myself several times before making the conclusion/guess that the bottleneck had to be the actual Web Service. Monitoring a custom counter in our Web Service yielded the huge pooling of request to write to our log (due to the limited threads). Hitting the pooling threshold in our app caused the requests to start backing out into the ASP.NET request queue. Makes sense now (hindsight is 20/20). Here is that helpful image:

Counters I used the most for Web Service performance testing:
For the most part Microsoft’s performance recommendations point you to all the right counters. There are quite a few, but I used the following for the most part:

To monitor my SQL 2000 Database Server:

PhysicalDisk\Avg. Disk Queue Length
Processor\% Processor Time
Memory\Committed Bytes
Network\Bytes Received/sec
Network\Bytes Sent/sec

To monitor my Web Server that was hosting the Web Service:

ASP.NET\Requests Queued
Processor\% Processor Time
Web Service\Total Method Requests/sec
Memory\Committed Bytes
Network\Bytes Received/sec
Network\Bytes Sent/sec

When the tests weren’t going so well I pretty much added all the counters that you can find in the performance recommendation links I provided above. This obviously helped with troubleshooting. Also, what was helpful to the IA team was the SQL performance tuning book: Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook by Ken England.

What’s next? Well, we’re out of hardware here at Corillian so we’ll be heading up to the Microsoft Scalability Labs in about a month to really push the limits for both IA and Voyager on their GIA-HUGEY hardware.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.