Why Are You Talking To THAT Domain Controller!?

I was in Salt Lake City most of this week. Being surrounded by stark snow-covered mountains made for some wonderful scenery... it could not be more different than it is here in Texas. Plus I got to meet and greet with a bunch of Novell and NetIQ people. And eat an enormous bone-in ribeye that no human being has any business eating in one sitting.

But anyway, here's a little AD mystery I ran in to a couple weeks ago, and it may not be as simple as you first think.

As Active Directory admins, something we're probably all familiar with is member servers authenticating with the "wrong" domain controller. By wrong, I mean a DC that is in a different site than the member server, when there's a perfectly fine DC right there in the same site as the member server, and so the member server is incurring cross-site communication when it doesn't need to be. Everything might still function well enough as long as the communication between DC and member server is successful, but now you're saturating your slower inter-site WAN links with AD traffic when you don't need to be. You should want your AD replication, group policy application, DFS referrals, etc., to run like a well-oiled machine.

I often work in a huge environment with AD sites in many countries and on multiple continents, and thousands of little /26 subnets that can't always be easily grouped into a predictable supernet for the purposes of linking subnets to sites in AD Sites & Subnets. So I'm always alert to the fact that if I log on to a server, and I notice that logon takes an abnormally long time, I very well could be logging on to the wrong DC. First, I run set log to see which DC I have logged on to:

set log*DC01 is in Amsterdam*

So in this case, I noticed that while I had logged on to a member server in Dallas, that server's logon server was a DC in Europe. :(

You immediately think "The server's IP subnet isn't defined in AD Sites & Services or is associated to the wrong site," don't you?  Yeah, me too. So I went and checked. Lo and behold, the server's IP subnet was properly defined and associated to the correct site in AD.

Now we have a puzzle. Back on the member server, I run nltest /dsgetsite to verify that the domain member does know to which site it belongs. (Which the domain member's NetLogon service stores in the registry in the DynamicSiteName value once it's discovered.)

I also ran nltest /dsgetdc:domain.com /Account:server01$ to essentially emulate the DC locator and selection process for that server, which basically just confirmed what we already knew:

C:\Users\Administrator>nltest /dsgetdc:domain.com /Account:server01$ 
           DC: \\DC01.DOMAIN.COM (In Amsterdam) 
      Address: \\ 
     Dom Guid: blah-blah-blah 
     Dom Name: DOMAIN.COM 
  Forest Name: DOMAIN.COM 
 Dc Site Name: Amsterdam 
Our Site Name: Arlington 
The command completed successfully

So where do we look next if there's no problem with the IP subnets in AD Sites & Services?  I'm going with DNS. We know that domain controllers register site-specific SRV records so that clients who know to which site they belong will know what DNS query to make to find domain controllers specific to their own site.  So what DNS records did we find for the Arlington site?

Forward Lookup Zones
                    _kerberos SRV NewYorkDC
                    _kerberos SRV SanDiegoDC
                    _kerberos SRV MadridDC
                    _kerberos SRV ArlingtonDC
                    _ldap     SRV NewYorkDC
                    _ldap     SRV SanDiegoDC
                    _ldap     SRV MadridDC
                    _ldap     SRV ArlingtonDC

OK, now things are getting weird.  All of these other domain controllers that are not part of the Arlington site have registered their SRV records in the Arlington site.  The only way I can imagine that happening is because of Automatic Site Coverage, whereby domain controllers will register their own SRV records into sites where it is detected that the site has no domain controllers of its own... combined with the fact that scavenging is turned off for the DNS server, including the _msdcs zone.  So someone, once upon a time, must have created the Arlington site in AD before the actual domain controllers for Arlington were ready.  What's more is that Automatic Site Coverage is supposed to intelligently use site link costing so that only the domain controllers in the next closest site provide "coverage" for the site with no DCs, not every DC in the domain. Turns out the domain did not have a site link strategy either - it used DEFAULTIPSITELINK for everything - the entire global infrastructure. So even after Arlington did get some domain controllers, the SRV records from all the other DCs stayed there because of no scavenging.

Here's the thing though - did you notice that almost every other domain controller in the domain had SRV records registered in the Arlington site, except for the domain controller in Amsterdam that our member server actually authenticated to!?

This is getting kinda' nuts.  So what else, besides the DNS query, does a member server perform in order to locate a suitable domain controller?

So after a client does a DNS query for _ldap._tcp.SITENAME._sites.ForestDnsZones.domain.com, and gets a response, the client then begins to do LDAP queries against the DCs given in the DNS response to make sure that the DCs are alive and servicing requests. If you want to see this for yourself, I recommend starting Wireshark, and then restarting the NetLogon service while the capture is running. If it turns out that none of the DCs in the list that was returned by the site-specific DNS query is responding to your LDAP queries, then the client has to back up and try again with a domain-wide query.

And that is what was happening. The client, server01, was getting a list of DCs for its site, even ones that were erroneously there, but I confirmed that it was unable to contact any of those domain controllers over port 389. So after that failed, the server was forced to try again with a domain-wide query, where it finally found one domain controller that it could perform an LDAP query on... a domain controller in Amsterdam.

Moral of the story: Always blame the network guys.


An Interesting DFSR Change (that probably everyone knew about but me)

First, wooo I hit 5K on ServerFault today.

I'm embarrassed to say that something I read about recently but didn't pay enough attention to at the time officially just bit me in the butt.  

A significant change occurred in January 2012 in the way that DFS Replication behaves.  Windows Server 2008 R2 SP1 post KB2663685 and Windows Server 2012 have changed the default behavior of DFSR.  Auto-recovery of DFSR replicated folders after unexpected shutdown is now disabled.  In other words, if a computer that hosts a DFSR replicated folder experiences an unexpected shutdown, DFSR will not automatically resume upon reboot.  (This includes Sysvol!

On older versions of Windows, DFSR auto-recovery was enabled.  I'm sure the reason for this change involves auto-recovery leading to unexpected rollbacks and unauthoritative conflict resolutions between replication partners, especially in wide-spread domains with high end-to-end replication latency and frequent changes... but even though the news was published, I for one didn’t pay enough attention to it and it has a very real effect on the way we manage our Windows systems that utilize DFSR going forward. 

So what if a domain controller or a file server with a DFSR share on it running 2008R2 or 2012 crashes unexpectedly, leaving the DFS database and the NTFS USN journal out of sync?  Then Sysvol no longer receives updates on that DC.  The DFSR file share no longer receives updates on that file server.  It's up to you to manually restart replication, and to resolve any conflicts with replication partners if changes took place during the time that the crashed server wasn't replicating. 

Luckily that is easy to do, and it's also possible to set the behavior back to auto-recovery if that is what you wish. 

How will I know if this effects my server?

While this first example is just a symptom of the problem, here is how it first came to my attention, triggering the investigation:

(Click on images for a better view.)

Errors in my event log

Application of Group Policy was failing, but only on DC02 and servers which were using DC02 as a domain controller.  Not DC01 or any server logged on by DC01.  As it turns out, the GPO referenced by that error event, a new GPO that I had just created on DC01, didn’t exist on DC02, hence the errors.  Sysvol did not seem to be replicating anymore.

Here is the actual event log event to let you know that DFS Replication has stopped on one or more volumes:

DFSR Error

Luckily, starting replication back up again is easy and the command to do it with your actual GUID, is right there in the event:

wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid="12345678-ABCD-1234-EFGH-1A2B3C4E5F" call ResumeReplication

 You can also turn auto-recovery back on with wmic or by modifying the registry if you don’t have the time to be bothered by this:

HKLM\System\CurrentControlSet\Services\DFSR\Parameters\StopReplicationOnAutoRecovery = 0

Just be aware that auto-recovery can lead to unwanted rollbacks of DFSR data in some circumstances.

GPO Application Precedence - "Just Because You Can" Edition

This one really gets back to my roots as a big fan of everything related to Active Directory and Group Policies. Someone had a question yesterday about GPO application that, I admit, gave me pause. It would have been an excellent question for an MCITP exam or a sysadmin interview.

It's also an example of a GPO strategy that might be too complicated for its own good.

The basic behavior of Group Policy application order is well-known by almost every Windows admin. Let's review:

  • Local policies are applied first.
  • Then policies linked at the site level.
  • Then policies linked at the domain level.
  • Then GPOs linked to OUs in order such that policies linked to "higher" OUs apply first, and the policies linked "closest" to the object go last.
  • If multiple GPOs are linked at the same level, they go from the bottom-up. (AKA by Link Order)
  • Last writer wins, i.e., each subsequent GPO overwrites any conflicting settings defined in earlier GPOs. Settings that do not conflict are merged.
  • Enforce (formerly known as No Override,) Block Inheritance and Loopback Processing can be used at various levels of the aforementioned hierarchy in various combinations to augment the behavior of GPO application.
So that seems like a pretty simple system, but it's just flexible enough that you can get into some confusing situations with it. For instance, take the following OU structure:
(OU)All Servers
       +--(OU)Terminal Servers

The Terminal Servers OU is a Sub-OU of the All Servers OU. Now, let's link two different policy objects to each of the OUs:

(OU)All Servers [Servers_GPO]
       +--(OU)Terminal Servers [TS_GPO]

So using what we know, we assume that a computer object in the Terminal Servers OU will get all the settings from Servers_GPO, and then it will receive settings from TS_GPO, which will overwrite any conflicting settings from Servers_GPO.

Now let's put the Enforced flag on Servers_GPO:

(OU)All Servers [Servers_GPO-ENFORCED]
       +--(OU)Terminal Servers [TS_GPO]

Now the settings in Servers_GPO will win, even if they conflict with settings in TS_GPO. But let's go one step further. What happens if you also Enforce TS_GPO?

(OU)All Servers [Servers_GPO-ENFORCED]
       +--(OU)Terminal Servers [TS_GPO-ENFORCED]

Which GPO will win?  Had I been taking a Microsoft exam, I might have had to flip a coin. I have to admit, I had never considered this scenario. If neither policy was enforced, we know TS_GPO would win. If Servers_GPO was enforced and TS_GPO was not enforced, then we know Servers_GPO would win. But what about now?

And furthermore, why would anyone want to do that? I can't explain what goes on in some administrator's heads when they're planning these things out, but luckily I did have Technet at my disposal:

You can specify that the settings in a GPO link should take precedence over the settings of any child object by setting that link to Enforced. GPO-links that are enforced cannot be blocked from the parent container. Without enforcement from above, the settings of the GPO links at the higher level (parent) are overwritten by settings in GPOs linked to child organizational units, if the GPOs contain conflicting settings. With enforcement, the parent GPO link always has precedence. By default, GPO links are not enforced.

So with that, we should be able to surmise that the parent GPO - Servers_GPO - will win. A little testing confirmed it - the higher-level GPO takes precedence over a lower-level GPO even when they're both enforced.

I might call this one of those "just because you can, doesn't mean you should" sort of administrative practices.

SQL Server - Unable to Generate SSPI Context

The different sorts of authentication mechanisms in play in a Windows network can be pretty complex.  So when someone asked me, "Why do I get a 'Could not generate SSPI context' error when I try to log in to a SQL server?" I knew that there could be several answers to that question.  Go ahead and Google it yourself -- you won't get a definite *This is absolutely your problem* sort of answer.

First I remembered that there was a situation where RSA SecureID tokens (essentially certificates for our purposes) were used for various authentication tasks in the domain, and if one tried to authenticate to a SQL Server with Windows authentication without having one's RSA token plugged in, the "Could not generate SSPI context" error would be generated. Plug the SecureID device into a USB slot, and you'd log in to the SQL server just fine. But I knew that policy was not an issue in this situation...

Then I thought about how services not having their SPNs registered with Active Directory can cause authentication problems.  Specifically, if a SQL Server doesn't have its SPN registered properly in Active Directory, Kerberos authentication cannot be used.  But that still shouldn't prevent you from authenticating whatsoever... it'll just drop you down to NTLM instead of Kerberos.

Also, I was able to perform a logon with Windows auth to the same SQL Server at the same privilege and security level as the user, so I knew it had to be something at their end.

The only other thing I could think of was that something was just wrong with their security token that was confusing their SSPI?  Maybe it was corrupt somehow?  I'm not sure.  So, I recommended that the user run "klist purge" to purge all their domain controller-issued tickets, knowing that they would be refreshed as soon as they requested access to a domain resource...

Bingo.  Problem solved.

Group Policy Preferences Passwords Continued

For the original post, see here.

So in yesterday's post, I mentioned that this guy wrote a neat tutorial and Powershell script called Get-GPPPasswords.ps1 that will decipher the passwords in a valid Groups.xml file.  You can find his scripts here. (The PowerSploit repository on Github.)  I wrote an additional function to go inside of Get-GPPPasswords this morning.  The purpose of the new function is to automatically search your own domain for Groups.xml files, and use Get-GPPPasswords on them.  This can be handy for finding all the Groups.xml files as quickly as possible, especially in a domain with lots of policies.  And especially if you're pressed for time.  It's very simple:

function Find-GPPPasswords 


Scan your own domain in search of valid Groups.xml files in SYSVOL. If found, use Get-GPPPassword on them.
Author: Ryan Ries (www.myotherpcisacloud.com)


PS C:\> . .\Get-GPPPassword.ps1
PS C:\> Find-GPPPasswords
	Write-Host "Now searching $Env:UserDNSDomain for Group Policy Preferences passwords..."
	$GroupsFiles = Get-ChildItem -Path "\\$Env:UserDNSDomain\SYSVOL" -Recurse -Include Groups.xml
	foreach($_ in $GroupsFiles)
		Get-GPPPassword -Path $_