Monday, January 31, 2011

Generating a memory dump for an IIS 6 hang

This post follows up on my sample demonstration and explains how to get a memory dump from a production web server crash.

Environment:
ASP .NET 3.5
IIS 6
Windows Server 2003 64bit
Windbg 6.11.0001.404 AMD64

w3wp was crashing on the server with this in the event log.

Event Type: Error
Event Source: .NET Runtime 2.0 Error Reporting
Event Category: None
Event ID: 1000
Date: 3/3/2010
Time: 1:22:18 PM
User: N/A
Computer: _____________
Description:
Faulting application w3wp.exe, version 6.0.3790.3959, stamp 45d691cc, faulting module kernel32.dll, version 5.2.3790.4480, stamp 49c51cdd, debug? 0, fault address 0x0000000000027ded.


For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

I set adplus to get a memory dump on hang.

After spelunking around I found out I was debugging the wrong dump file w3wp.exe-APPNAME

decided to debug a different dump file: w3wp.exe_-CRMAppPool
.sympath SRV*c:\debugsymbols*http://msdl.microsoft.com/download/symbol
.loadby sos mscorwks
!threads

ThreadCount: 11
UnstartedThread: 0
BackgroundThread: 11
PendingThread: 0
DeadThread: 0
Hosted Runtime: yes
                                              PreEmptive                                                Lock
       ID OSID        ThreadOBJ     State   GC     GC Alloc Context                  Domain           Count APT Exception
XXXX    1 1a14 00000000001aa520   1808220 Disabled 0000000000000000:0000000000000000 00000000055650d0     1 Ukn (Threadpool Worker) System.StackOverflowException (00000000100b10d0)
XXXX    2 2014 000000000019d5c0      b220 Enabled  0000000000000000:0000000000000000 0000000000122e20     0 Ukn (Finalizer)
XXXX    3 2298 0000000005722570      1220 Enabled  0000000000000000:0000000000000000 0000000000122e20     0 Ukn
XXXX    4 1524 000000000556b570    80a220 Enabled  0000000000000000:0000000000000000 0000000000122e20     0 Ukn (Threadpool Completion Port)
XXXX    5  904 000000000556bb40   200b220 Enabled  0000000000000000:0000000000000000 00000000055650d0     0 Ukn
XXXX    7 2228 000000000c7e17b0   880a220 Enabled  0000000000000000:0000000000000000 0000000000122e20     0 Ukn (Threadpool Completion Port)
XXXX    8 2324 000000000c6ee560   200b220 Enabled  0000000000000000:0000000000000000 00000000055650d0     0 Ukn
XXXX    9 19f8 00000000055c7480   200b220 Enabled  0000000000000000:0000000000000000 00000000055650d0     0 Ukn
XXXX    a 2104 0000000005b8b9e0   180b220 Enabled  0000000000000000:0000000000000000 00000000055650d0     1 Ukn (Threadpool Worker)
XXXX    b 213c 000000000d039d10   180b220 Enabled  0000000000000000:0000000000000000 0000000000122e20     0 Ukn (Threadpool Worker)
XXXX    6 2370 000000000de00b90   200b220 Enabled  0000000000000000:0000000000000000 00000000055650d0     1 Ukn

Aha!  three threads with locks and one with a StackOverflowException.
The column with the XXXX is supposed to list out the native thread so I can switch to it and see the problem.
0:000> ~
.  0  Id: 226c.2274 Suspend: -1 Teb: 000007ff`fffa2000 Unfrozen

When this memory dump was taken the native thread had already been cleaned up.

A KB article provided instructions on running an action when a process is orphaned.

Following those instructions I get a log file but no .dmp :(

I modify action.cmd changing %COMMAND% to %COMMAND% >> %LOG%

Now the log displays: 

Microsoft (R) Windows Debugger Version 6.11.0001.404 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Cannot debug pid 8220, Win32 error 0n87
    "The parameter is incorrect."
Debuggee initialization failed, Win32 error 0n87
    "The parameter is incorrect."

Hmmm... maybe I can manually take a dump.

Using iisapp I see:

W3WP.exe PID: 11072   AppPoolId: ClientProject
W3WP.exe PID: 10824   AppPoolId: 3rdPartyVendor
W3WP.exe PID: 4836   AppPoolId: AnotherClientProject
W3WP.exe PID: 8220   AppPoolId: CRMAppPool

create problem.  wait 30 seconds:

C:\>"C:\Program Files\Debugging Tools for Windows (x64)\cdb.exe" -c ".dump /o /m
a c:\crash_PID_11072_3_03_2010_10_01_16_54.dmp;q" -p 11072

success!  Using this dump we were able to identify the problem.

No comments: