Friday, November 15, 2013

Use application lock to synchronize T-SQL code

Application lock can be used to synchronize T-SQL code, you can control if the session is able to run the t-sql code at same time, it is like the lock keyword in C#.

1. Get lock
you can get application lock with store procedure sp_getapplock, which must be run from within a transaction.

EXEC @res = sp_getapplock @Resource = 'mylock', @LockMode = 'shared';
select @res

2. Release lock
EXEC sp_releaseapplock @Resource = 'mylock';

Here is example
1. Start a session and run the code below
--spid 52
EXEC @res = sp_getapplock @Resource = 'mylock', @LockMode = 'exclusive';
select @res

2. Open another session and run the code below
--spid 54
EXEC @res = sp_getapplock @Resource = 'mylock', @LockMode = 'shared';
select @res

the session spid 54 will be blocked, you can check the blocking status with "sp_lock" or DMV "sys.dm_tran_locks".

here, spid54 is blocked by spid52 and waiting for application lock "mylock". 
if you release lock in spid 52, spid54 can get the application share lock then. by this way you can synchronize the code execution between sp_getapplock and sp_releaseapplock among multi-thread.

like all other lock, application lock will cause deadlock issue as well, please note, if the sp_getapplock is failed because of deadlock victim, the transaction will not be rollback, sp_getapplock will return -3 instead. it will cause orphan transaction sometimes, so you need to rollback the transaction explicitly. 

Saturday, November 9, 2013

log shipping restore job failed on secondary server

Several Situation will cause the log shipping restore job failed on secondary server

1. Restore job is not able to access the transaction log backup file.

    a) Other process is accessing the backup file, and lock the file, so restore job can not access it at the same time. 
        you can use "Process Explorer" to check which process locks the transaction log backup file
    b)permission issue. 
        check the secondary server,  make sure the SQL Agent service startup account has been granted correct permission.

2. Transaction log backup file is corrupted
    Then have to rebuild the log shipping.....

3. Some log backup file is missing
    All the log backup files should be consecutive. the restore job restores the log backup files based on the time stamp in backup file name. on secondary, System table msdb..log_shipping_secondary_database has a column saving the file name of last restored backup file,  any log backup file with bigger time stamp in its file name will be restored, if there is log backup file missing, the lsn chain will be broken, then the restore job will be failed. In order to find out the missing file, you can:
    a) on primary server, run the query below to list all log backup file history
select distinct s.first_lsn, s.last_lsn, s.backup_finish_date,y.physical_device_name
from msdb..backupset as s inner join
msdb..backupfile as f on f.backup_set_id = s.backup_set_id inner join
msdb..backupmediaset as m on s.media_set_id = m.media_set_id inner join
msdb..backupmediafamily as y on m.media_set_id=y.media_set_id
where (s.database_Name='xxx') order by s.backup_finish_date desc

   b) on Secondary, 
       run "Restore headeronly" to check the log backup file first_lsn and Last_lsn

   c) check the database last lsn on secondary server with "DBCC DBTABLE"
       --Get the database last lsn
declare @str char(64),
@file bigint,
@file_size int,
@array bigint,
@array_size int,
@slot int,
@command nvarchar(400),
@end_of_log_lsn     numeric(25,0),
@database_name varchar(100)
        --Set the database name here
Set @database_name='mytest'

set @command = N'dbcc dbtable(' + CONVERT (nvarchar, DB_ID(@database_name)) + N') with tableresults, no_infomsgs'

declare @temp table(parentObject nvarchar(255),Object nvarchar(255),field nvarchar(255),value nvarchar(255))
insert into @temp exec( @command )
select @str = value from @temp where field like '%m_flushLSN%';
set @file_size = charindex(N':', @str)
set @file = cast( LEFT( @str, @file_size - 1) as bigint)
set @array_size = charindex(N':', @str, @file_size + 1)
set @array = cast( substring( @str, @file_size + 1, @array_size - @file_size - 1) as bigint)
set @slot = cast( substring( @str, @array_size + 1, charindex(N' ', @str, @array_size + 1) - @array_size ) as int)
set @end_of_log_lsn = @file * 1000000000000000 + @array * 100000 + @slot
-- make sure @slot < MAX_SHORT (1024 * 64)
-- check to make sure tht file and array are < 4 Gig.

if (@end_of_log_lsn < 1000000000000000 or @slot >= 65536 or @file >= 4294967296 or @array >= 4294967296)
set @end_of_log_lsn = null

select @end_of_log_lsn as 'Last LSN'
After you get the last lsn of the database on secondary, comparing it with the log backup file first_lsn and Last_lsn to see if it is out of the backup file lsn range. if you can not find the db last lsn fit into any log backup files, then it mean there is file missing, then use the query in step a) to search which file is missing. 

and bear in mind, the first thing is checking restore job history to get the detail error message, then use the steps upper to narrow down the issue. 

Friday, November 8, 2013

"Failover Partner" keyword in connection string - Q&A

1. How to connect client to a Database Mirroring Session with failover aware

A: add FailoverPartner keyword in the connection string, for instance:
"Server=sq01;Failover Partner=sq02; Database=mydb; Network=dbmssocn;Trusted_Connection=yes"

2. why I get timeout error more frequently when I connect to mirrored database with keywords "Failover Partner"?

A: You probably encounter a .Net bug which happens on mirrored database, please refer to the link below
the workaround would be 
•Set the time-out connection string to the 150 

3. If the "failover partner" parameter is incorrectly setup in the connection string, is the client still aware of the failover, and connect to the mirror(new primary) after failover?

A: The Answer is YES.
when client connect to the primary server first time, it will download the mirror server name, and cache it in client process as failover partner name, it means, even if you set the wrong "Failover Partner" value in the connection string, client still can connect to the correct mirror server when failover happen. for instance, 

we have database mirror setup on SQ01 and SQ02, SQ01 is primary, in your connections string, we use SQ03 as failover partner as below, actually sq03 is not existed. 

"Server=sq01;Failover Partner=sq03; Database=mydb; Network=dbmssocn;Trusted_Connection=yes"

when you failover db from sq01 to sq02, your connection will be failed, but after you reconnect it without changing any connection string,  you can still get the data which is from sq02. 

so even without specifying the failover partner name in the connection string, client can still access primary or mirror when failover, just like below
"Server=sq01;Failover Partner=; Database=mydb; Network=dbmssocn;Trusted_Connection=yes"

Now you know, the words "Failover Partner" is the key in the connection string

4. how about the "Server" keywords in the connection string? if it is wrong, can client still support failover?

A: "Server" is most important in the connection string, here has several situation. let's use the previous sample
primary is sq01, mirror is sq02, sq03 is not existed.

1. "Server=sq03;Failover Partner=sq02; Database=mydb; Network=dbmssocn;Trusted_Connection=yes" 
Failed, client can not connect to the server since 02 is mirror and not online.

2. "Server=sq03;Failover Partner=sq01; Database=mydb; Network=dbmssocn;Trusted_Connection=yes" 
successful, client connected 03 failed, then tried to connect failover partner which is sq01
but it doesn't support failover database from sq01 to sq02.  in other words, after client get data from sq01 successfully, if we failover db from sq01 to sq02, client will lost connection, and can not failover to sq02 automatically.

5. last question, just for interesting, if 2 db have same name on sq01 and sq02, both online, does it supports failover

A: No, it will not happen. let's say you have database mydb on sq01 and sq02, both are online without mirroring, you use the connection string below
"Server=sq01;Failover Partner=sq02; Database=mydb; Network=dbmssocn;Trusted_Connection=yes" 

then , if the db on sq01 is online, your client can get data from sq01, but if the db on sq01 is not accessible, you will get error below:
Exception calling "Open" with "0" argument(s): "Server sq02, database mydb is not configured for database mirroring."

so it tells us the client provider is aware of mirror :) don't want to cheat it. 

database mirror is phasing out from new sql server version, alwayson is new technology of HA now.

Wednesday, November 6, 2013

One Database Mirror disconnected case

Today I got a database mirror case, I was told the all databases were in disconnected\in recovery status on both primary and mirror server, it is a dev environment.

Here just share my troubleshooting steps:

1. First, I login primary and mirror server. I got the same database status on both 2 servers

so it showed to me:
       a) since all databases had problem, it should be a system level error, not single database issue
       b) database in "disconnected/In Recovery" status, it indicts the server not only lost communication with the partner server(primary or mirror), but also lost communication with witness. so sql server can not bring database online on either servers, or else it will cause split-brain problem.

the question now is : why the 3 servers lost communication with each other?

2. Check connection among 3 servers with telnet command
telnet [sqlservername] [port]

all 3 servers can connect each other, so the network should be fine, and endpoing port is opened, firewall should be fine.

3. Check SQL Server Errorlog
I checked the sql server errorlog on all 3 servers(primary, mirror, and witness), I found same error message below:
2013-11-06 23:07:47.70 Logon       Database Mirroring login attempt by user 'xxx\xxxxx-db.' failed with error: 'Connection handshake failed. The login 'xxx\xxxxx-db' does not have CONNECT permission on the endpoint. State 84.'.  [CLIENT: xx.xx.xx.xx]

so it should be permission account of account 'xxx\xxxxx-db'

4. Check Account 'xxx\xxxxx-db' in sql server
By checking account permission, it shows the account is sys admin on every sql servers. so there should be no permission issue.

in your case, if you find account is not sys admin, make sure you grant connect endpoint permission to that account.

so what's wrong here? 
Although the account is sys admin in sql server, but it doesn't mean the account is 100% ok, because it is window domain account, when using window authentication mode, sql server will ask window to authenticate the user account first(by NTLM or Kerberos), if windows authentication is failed, sql server will reject that login request.

so I tried to login sql server with the account 'xxx\xxxxx-db', this time I got error message below:

Good. we are closed to the answer. normally, you can find the detail information of login failed error in errorlog just like below:
2006-02-27 00:02:00.34 Logon     Error: 18456, Severity: 14, State: 8.
2006-02-27 00:02:00.34 Logon     Login failed for user '<user name>'. [CLIENT: <ip address>]

the most important information is "State" code:

2 and 5
Invalid userid
Attempt to use a Windows login name with SQL Authentication
Login disabled and password mismatch
Password mismatch
Invalid password
11 and 12
Valid login but server access failure
SQL Server service paused
Change password required

But unfortunately, this time I didn't find anything from errorlog! then how to capture the detail information? you are right, Profiler trace!

I run the Profiler trace on sql server, and enabled the "Audit login failed" event only. finally I got the detail error message below

by searching the error message, I found the link below

It prove my guess that the window authentication has problem, in other words, the Account should have problem. 

what is the next step?

5. Check if there was any changes made before the error occurred
I am not a AD expert, troubleshooting the window account authentication is little bit hard for me, I can do it, but it might take several hours or days to troubleshooting it. so the easiest way is checking with the guy who report this issue to me, or ask AD expert to look into it. 

I asked the engineer who report this issue : is there any change they made on the account 'xxx\xxxxx-db' before the error occurred? suddenly they seems to remember something, and told me wait a minute, then they come back to me and said the issue has been solved, the mirror is recovered  now. I went back to the server, it is true, all database are in mirror synchronized status.

I asked them what happen, they told they made some changes on that account(like pull the account out of some domain groups), if I didn't mention the account to them, they would never think about the problem was caused by their change. so problem is solved. 

Despite no root cause here, we have some steps to narrow down the problem, and find a fastest way to solve it.