Tuesday 18 February 2014

DFSR - picking up where someone else left off

Wow, its been a while since I last posted, but with four kids, a full time job and trying to increase my knowledge by self study - its hard to keep this active.

Straight to the point... We took over the IT for a two branch estate agent down the road from us who are all using fairly old computers (mostly Vista! - cringe with me) an SBS 2008 on some crappy hardware, massive lack of updates, no backup to speak of, the offsite online wasn't working and hadn't been for months. The previous company were a large national IT support company, who are supposedly excellent and had local IT staff dedicated to each area ...

One of the jobs I've had was to make the DFSR work between their head office and the other office on the other side of the city, which had a local Windows 2008 R2 server. Over a slow DSL connection.

After days and days of hitting DFSR with everything I could throw at it, I have now got a perfect replication going on. This thing replicates changes over a slow ass link, in no time at all.

I'll assume you know a thing or two about Windows Server and administering it.
I'll assume you've used something like "dfsrdiag backlog /rgname:server.local\company\share /rfname:share /sendingmember:Svr01 /receivingmember:Svr02 >>Backlog.txt" to figure out how large your backlog is and that you do indeed have a problem.

I'll list below the things which I found to aid in solving the replication issues.

  • Hotfixes, hotfixes, hotfixes! Install all those you can, there's a nice long list here, try to install them in chronological order, from oldest to newest (that might not be necessary, but I did it that way as I wanted to be thorough)
    http://support.microsoft.com/kb/968429 
  • Check the size of the Staging Area, if its ridiculously large, decrease it. This one was set to 70Gb and the staging folder was using 60Gb in actual size!!! Do this using the DFS Snap-in.
  • Exclude the hidden "dfsrPRIVATE" folder from AV scans, maybe have a look around and exclude the files themselves incase your AV still tries to scan them. Apparently some AV programs don't always abide by these exclusions, so might be worth disabling it temporarily as a test, do be careful and enable it as soon as possible or switch to an AV product which does play well with DFSR.
  • Clear out the Conflicts folder manually, using the info from this blog:
    http://blogs.technet.com/b/askds/archive/2008/10/06/manually-clearing-the-conflictanddeleted-folder-in-dfsr.aspx
  • Exclude Thumbs.db from being replicated, using the DFS Snap-in. We had well over 3000 some of which were sitting in the backlog.
  • Enable RDC if it is disabled and you're replicating over a slow link. If you're on a fast link, disable it as this can increase replication speed of smaller files. If you're not sure, leave it as is and continue reading up on DFSR and maybe toggle it every few hours/days to see which works best for you in your environment.
  • General server health, check everything. If the server is running like a dog and doing too much other stuff, then you're going to have a bad time. Move the other roles over to a different server, to free up your DFSR server.

I also made a bunch of registry tweaks to both servers, following the guide from the link below:
http://blogs.technet.com/b/askds/archive/2010/03/31/tuning-replication-performance-in-dfsr-especially-on-win2008-r2.aspx

Aswell as enabling automatic resuming after an "unexpected shutdown" as per:

Also worth a read is the top 10 common causes of slow replication with dfsr:
http://blogs.technet.com/b/askds/archive/2007/10/05/top-10-common-causes-of-slow-replication-with-dfsr.aspx

and common dfsr configuration mistakes and oversights:

There's a nie 120 page document on performance monitoring DFSR:
http://download.microsoft.com/download/3/2/A/32A70368-1457-4972-8CDD-08A496198361/Perf-tun-srv-R2.docx

You might always want to review the "File Services" hotfix list, which include NTFS and SMB hotfixes:
http://support.microsoft.com/kb/2473205/en-us


Some of the errors I've seen in the DFS Replication event logs are listed below:


Log Name:      DFS Replication
Source:        DFSR
Date:          23/01/2014 19:01:06
Event ID:      5002
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SVR02.company.local
Description:
The DFS Replication service encountered an error communicating with partner SVR01 for replication group company.local\company\docs.

Partner DNS address: SVR01.company.local

Optional data if available:
Partner WINS Address: SVR01
Partner IP Address: 192.168.1.1

The service will retry the connection periodically.

Additional Information:
Error: 9036 (Paused for backup or restore)
Connection ID: ABB8F2AF-****-4A7E-A7D6-9BD3D37B7777
Replication Group ID: 5BFF9BB5-****-4B52-98B3-F87429A3AAAA




Log Name:      DFS Replication
Source:        DFSR
Date:          23/01/2014 19:00:50
Event ID:      5014
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      SVR02.company.local
Description:
The DFS Replication service is stopping communication with partner SVR01 for replication group company.local\company\docs due to an error. The service will retry the connection periodically.

Additional Information:
Error: 9033 (The request was cancelled by a shutdown)
Connection ID: ABB8F2AF-****-4A7E-A7D6-9BD3D37B7777
Replication Group ID: 5BFF9BB5-****-4B52-98B3-F87429A3AAAA




Lastly, if you're on Windows Server 2003 platform, upgrade. Do it now. Move on. Microsoft have made massive changes in each DFSR iteration, even from 2008 to 2008 R2 they made huge changes to the performance. I trust the advancements in 2012/2012 R2 have been excellent too, here a blog from Ned about the changes from 2008 R2 to 2012: 

and the changes in 2012 R2 DFSR:


No comments:

Post a Comment