Why Are You Talking To THAT Domain Controller!?

I was in Salt Lake City most of this week. Being surrounded by stark snow-covered mountains made for some wonderful scenery... it could not be more different than it is here in Texas. Plus I got to meet and greet with a bunch of Novell and NetIQ people. And eat an enormous bone-in ribeye that no human being has any business eating in one sitting.

But anyway, here's a little AD mystery I ran in to a couple weeks ago, and it may not be as simple as you first think.

As Active Directory admins, something we're probably all familiar with is member servers authenticating with the "wrong" domain controller. By wrong, I mean a DC that is in a different site than the member server, when there's a perfectly fine DC right there in the same site as the member server, and so the member server is incurring cross-site communication when it doesn't need to be. Everything might still function well enough as long as the communication between DC and member server is successful, but now you're saturating your slower inter-site WAN links with AD traffic when you don't need to be. You should want your AD replication, group policy application, DFS referrals, etc., to run like a well-oiled machine.

I often work in a huge environment with AD sites in many countries and on multiple continents, and thousands of little /26 subnets that can't always be easily grouped into a predictable supernet for the purposes of linking subnets to sites in AD Sites & Subnets. So I'm always alert to the fact that if I log on to a server, and I notice that logon takes an abnormally long time, I very well could be logging on to the wrong DC. First, I run set log to see which DC I have logged on to:

set log*DC01 is in Amsterdam*

So in this case, I noticed that while I had logged on to a member server in Dallas, that server's logon server was a DC in Europe. :(

You immediately think "The server's IP subnet isn't defined in AD Sites & Services or is associated to the wrong site," don't you?  Yeah, me too. So I went and checked. Lo and behold, the server's IP subnet was properly defined and associated to the correct site in AD.

Now we have a puzzle. Back on the member server, I run nltest /dsgetsite to verify that the domain member does know to which site it belongs. (Which the domain member's NetLogon service stores in the registry in the DynamicSiteName value once it's discovered.)

I also ran nltest /dsgetdc:domain.com /Account:server01$ to essentially emulate the DC locator and selection process for that server, which basically just confirmed what we already knew:

C:\Users\Administrator>nltest /dsgetdc:domain.com /Account:server01$ 
           DC: \\DC01.DOMAIN.COM (In Amsterdam) 
      Address: \\ 
     Dom Guid: blah-blah-blah 
     Dom Name: DOMAIN.COM 
  Forest Name: DOMAIN.COM 
 Dc Site Name: Amsterdam 
Our Site Name: Arlington 
The command completed successfully

So where do we look next if there's no problem with the IP subnets in AD Sites & Services?  I'm going with DNS. We know that domain controllers register site-specific SRV records so that clients who know to which site they belong will know what DNS query to make to find domain controllers specific to their own site.  So what DNS records did we find for the Arlington site?

Forward Lookup Zones
                    _kerberos SRV NewYorkDC
                    _kerberos SRV SanDiegoDC
                    _kerberos SRV MadridDC
                    _kerberos SRV ArlingtonDC
                    _ldap     SRV NewYorkDC
                    _ldap     SRV SanDiegoDC
                    _ldap     SRV MadridDC
                    _ldap     SRV ArlingtonDC

OK, now things are getting weird.  All of these other domain controllers that are not part of the Arlington site have registered their SRV records in the Arlington site.  The only way I can imagine that happening is because of Automatic Site Coverage, whereby domain controllers will register their own SRV records into sites where it is detected that the site has no domain controllers of its own... combined with the fact that scavenging is turned off for the DNS server, including the _msdcs zone.  So someone, once upon a time, must have created the Arlington site in AD before the actual domain controllers for Arlington were ready.  What's more is that Automatic Site Coverage is supposed to intelligently use site link costing so that only the domain controllers in the next closest site provide "coverage" for the site with no DCs, not every DC in the domain. Turns out the domain did not have a site link strategy either - it used DEFAULTIPSITELINK for everything - the entire global infrastructure. So even after Arlington did get some domain controllers, the SRV records from all the other DCs stayed there because of no scavenging.

Here's the thing though - did you notice that almost every other domain controller in the domain had SRV records registered in the Arlington site, except for the domain controller in Amsterdam that our member server actually authenticated to!?

This is getting kinda' nuts.  So what else, besides the DNS query, does a member server perform in order to locate a suitable domain controller?

So after a client does a DNS query for _ldap._tcp.SITENAME._sites.ForestDnsZones.domain.com, and gets a response, the client then begins to do LDAP queries against the DCs given in the DNS response to make sure that the DCs are alive and servicing requests. If you want to see this for yourself, I recommend starting Wireshark, and then restarting the NetLogon service while the capture is running. If it turns out that none of the DCs in the list that was returned by the site-specific DNS query is responding to your LDAP queries, then the client has to back up and try again with a domain-wide query.

And that is what was happening. The client, server01, was getting a list of DCs for its site, even ones that were erroneously there, but I confirmed that it was unable to contact any of those domain controllers over port 389. So after that failed, the server was forced to try again with a domain-wide query, where it finally found one domain controller that it could perform an LDAP query on... a domain controller in Amsterdam.

Moral of the story: Always blame the network guys.


Comments (10) -

Hi !

We've exactly the same problem, what was the solution to fix that ?

Thank you !

In this case, even though the sites and subnets were configured correctly in AD, the client did not have network connectivity to the DC in its own site because of a poorly configured firewall, so the client was doing what it was designed to do and going outside of its site to find a DC.

And that is what was happening. The client, server01, was getting a list of DCs for its site, even ones that were erroneously there, but I confirmed that it was unable to contact any of those domain controllers over port 389. "So after that failed, the server was forced to try again with a domain-wide query", where it finally found one domain controller that it could perform an LDAP query on... a domain controller in Amsterdam.

When you say, "So after that failed, the server was forced to try a domain-wide query". What did it look up specifically for a "domain-wide" query?

The client queries for the name _LDAP._TCP.dc._msdcs.domainName and receives a list of all domain controllers in the domain, essentially in random order.  It then walks the list, attempting LDAP binds with each server in the list until it finds one it can talk to.

I love the moral of the story Smile

Nice article

I love that you are the only blog/forum post on the web who has defined the exact issue I'm having in my organization. However, mine only affects one or two clients in a site each time and it's random as to which machines have this issue. It does not affect every client, which makes this difficult to pin on the network guys. If I ever find the solution to my problem I will hopefully remember to post back here. It's definitely not a DNS issue and the server in that site never has issues, it's just the clients. And we've combed through AD S+S and GPO as deep as possible, unable to find any hints.

This has been very helpful on where to look on troubleshooting this issue.
In our case I'm seeing several references in the Forward and Reverse Lookup zones to domain controllers that were taken offline when we upgraded our domain from 2003 to 2008R2.  I attempted to scavenge them but it did not work. Could I manually delete those records from our DNS servers safely?


Yes, you can and should delete DNS resource records for domain controllers that no longer exist.  Scavenging will not take care of all of these records because of the records are static and scavenging won't touch static records.

Do you have an idea how to prevent member servers' or clients' ldap connection requests to a momentarily unavailable dc and instead redirect them to "only" available dcs. One possibility would be putting those unavailable dcs to their own /32 subnets creating according sitelinks and increasing the costs of those site links so that they won't be preferred by the clients. This, however, could be a tedious task for the admins and by an emergency case not the best alternative to proceed on.

Comments are closed