Login timeouts in SQL Server due to delayed SSPI handshakes
By: Date: March 10, 2019 Categories: extended events,procmon,SQL Server,wait_info_external

Rarely occurring login issues in SQL Server are hard to explain when there is no significant load on the Instance or its host. A good idea for troubleshooting is using extended events, symbol files and the sysinternals tool, procmon.

For some applications it is crucial to have a login timeout of less than 5 seconds. Now if an active directory user cannot be validated within this time frame, then the login will fail and the application gets an error message like this one for example:

The security database on the server does not have a computer account for this workstation trust relationship.

When searching in the system_health extended event session we can see that there were wait_info_external events which lasted more than 5 seconds. For this wait event the callstacks are collected, so when the symbol files are installed for SQL Server then we can have a clue about where that wait time was spent. Here is a nice overview on how to download the symbol files for different SQL Server versions: https://github.com/arvindshmicrosoft/SQLCallStackResolver/wiki/Obtaining-symbol-files-(.PDB)-for-SQL-Server-Releases


To use the symbol files and look for the wait_info_external events the following query can be used:

DBCC TRACEON (3656, -1);
SELECT CONVERT(datetime2, SWITCHOFFSET(CONVERT(datetimeoffset, CONVERT(XML, event_data).value(N'(event/@timestamp)[1]',N'nvarchar(max)')), DATENAME(TzOffset, SYSDATETIMEOFFSET()))) as ts
, CONVERT(XML, event_data) as event_data_xml
, object_name
, file_name
, file_offset
FROM sys.fn_xe_file_target_read_file(N'System_health*.xel', NULL, NULL, NULL)
where object_name = 'wait_info_external'
order by ts desc
DBCC TRACEOFF (3656, -1);

Now we can see the following stack trace in the event_data column of the extended event:

 sqldk.dll!XeSosPkg::wait_info_external::Publish+0x1a9
sqldk.dll!SOS_Scheduler::UpdateWaitTimeStats+0x596
sqldk.dll!SOS_Task::PopWait+0xba
sqllang.dll!SOS_Task::AutoSwitchPreemptive::~AutoSwitchPreemptive+0x5d
sqllang.dll!Sspi::Handshake+0x12bd
sqllang.dll!CryptoBase::ProcessAndWriteSecurityToken+0x118
sqllang.dll!CryptoBase::ProcessAddProvider+0x19f
sqllang.dll!SNIProcessAddProviderOnWorker+0x39
sqldk.dll!SOS_Task::Param::Execute+0x231
sqldk.dll!SOS_Scheduler::RunTask+0xaa
sqldk.dll!SOS_Scheduler::ProcessTasks+0x3cd
sqldk.dll!SchedulerManager::WorkerEntryPoint+0x2a1
sqldk.dll!SystemThread::RunWorker+0x8f
sqldk.dll!SystemThreadDispatcher::ProcessWorker+0x2de
sqldk.dll!SchedulerManager::ThreadEntryPoint+0x1d8
KERNEL32.DLL!BaseThreadInitThunk+0x22
ntdll.dll!RtlUserThreadStart+0x34

Here is what we were looking for: sqllang.dll!Sspi::Handshake
This proves that it is actually an SSPI handshake the SQL Server is waiting for.

We can get more information on this handshake when we trace the lsass.exe process on the host of the SQL Server instance with procmon. We can see that the SQL Server is requesting data from the Domain Controller to validate the kerberos credentials. In most cases the domain controller gives an answer within some milliseconds so the Active Directory account connecting to the SQL server can be validated quickly and the connection can be established.

lsass.exe procmon trace no delay

In the picture above we can see that the domain controller dc1 responded in less than a millisecond to the request of the SQL Server instance.

As I wanted to test if the answer from the dc1 domain controller is always so quick I created the following python script to establish a kerberos connection to the SQL Server isntance every second:

import win32com.client
import time
import datetime
import os
import sys
from datetime import datetime

server = 'sqlvm1\SQLInstance1'
database = 'master'
port = 1433
driver= '{ODBC Driver 11 for SQL Server}'

now = datetime.now()
file_ts = now.strftime("%d%m%Y_%H%M%S")

orig_stdout = sys.stdout
f = open('c:/temp/sql_conn_py_test_' + file_ts + '.txt', 'w')
sys.stdout = f
connStr = 'provider=MSDASQL;DRIVER='+driver+';SERVER='+server+';PORT=' + str(port) + ';DATABASE='+database+';trusted_connection=yes;'

while True:
	os.system("klist purge >nul")
	cnxn = win32com.client.Dispatch('ADODB.Connection')
	start_time = time.time()
	conn_open = 0
	e=''
	try:
		cnxn.Open(connStr)
		conn_open = 1
	except:
		e = sys.exc_info()[0]
	
	elapsed_time = round(time.time() - start_time, 1)
	now = datetime.now()
	
	if conn_open == 1:
		cnxn.Close()
	
	if elapsed_time > 0.1:
		print now.strftime("%d%m%Y_%H%M%S;")+str(elapsed_time) + ' ' + str(e)
		sys.stdout.flush()
	time.sleep(1)

When running the script from a host different from the SQL Server instance host it will log the length of each connection which took longer than 0.1 Second in a logfile in the c:\temp directory. Please note that while running the script I was also tracing the activity of lsass.exe on the sqlvm1 host.

After some time, establishing the connection to the SQL Server instance took more than 9 seconds, so I checked procmon trace for the timestamp written in the logfile of my script and I could clearly see that it was the domain controller dc1 which needed more than 9 seconds to send an answer to my SQL Server instance for validating my Active Directory account:

lsass.exe procmon trace delayed handshake

This proves that it was not the SQL Server or the host which was taking a long time with validation. So the root cause of the problem must be in active directory.

Leave a Reply

Your email address will not be published. Required fields are marked *