Monitoring with Windows Remote Management (WinRM) and Powershell Part I

Hey guys. I should have called this post "Monitoring with Windows Remote Management (WinRM), and Powershell, and maybe a Certificate Services tutorial too," but then the title would have definitely been too long. In any case, I poured many hours of effort and research into this one. Lots of trial and error. And whether it helps anyone else or not, I definitely bettered myself through the creation of this post.

I'm pretty excited about this topic. This foray into WinRM and Powershell Remoting was sparked by a conversation I had with a coworker the other day. He's a senior Unix engineer, so he obviously enjoys *nix and when presented with a problem, naturally he approaches it with the mindset of someone very familiar with and ready to use Unix/Linux tools.

I'm the opposite of that - I feel like Microsoft is the rightful king of the enterprise and usually approach problems with Windows-based solutions already in mind. But what's important is that we're both geeks and we'll both still happily delve into either realm when it presents an interesting problem that needs solving. There's a mutual respect there, even though we don't play with the same toys.

The Unix engineer wants to monitor all the systems using SNMP because it's tried and true and it's been around forever, and it doesn't require an agent or expensive third-party software. SNMP wasn't very secure or feature-rich at first so now they're on SNMPv3. Then there's WBEM. Certain vendors like HP have their own implementations of WBEM. I guess Microsoft wasn't in love with either and so decided to go their own way, as Microsoft is wont to do, hence why you won't find an out of the box implementation of SNMPv3 from Microsoft.

One nice thing about SNMP though, is that it uses one static, predictable port.

In large enterprise IT infrastructures, you're likely to see dozens of sites, hundreds (if not thousands,) of subnets, sprinklings of Windows and Unix devices all commingled together... and you can't swing a dead cat without hitting a firewall which may or may not have some draconian port restrictions on it. Furthermore, in a big enterprise you're likely to see the kind of bureaucracy and separation of internal organizations such that server infrastructure guys can't just go and reconfigure firewalls on their own, network guys can't just make changes without running it by a "change advisory board" first, and it all basically just makes you want to pull your hair out while you wait... and wait, and wait some more. You just want to be able to communicate with your other systems, wherever they are.

Which brings us to WinRM and Powershell Remoting. WinRM, a component of Windows Hardware Management, is Microsoft's implementation of the multi-platform, industry-standard WS-Management protocol. (Like WMI is Microsoft's implementation of WBEM. Getting tired of the acronym soup yet? We're just getting started. You might also want to review WMI Architecture.) I used WinRM in a previous post, but only used the "quickconfig" option. Seems like most people rarely go any deeper than the quickconfig parameter.

Here's an excerpt from a Technet doc:

"WinRM is Microsoft's implementation of the WS-Management protocol, a standard Simple Object Access Protocol (SOAP)-based, firewall-friendly protocol that enables hardware and operating systems from different vendors to interoperate. You can think of WinRM as the server side and WinRS the client side of WS-Management."

I bolded the phrase that especially made my ears perk up. You see, Windows has a long history with things like RPC and DCOM. Those protocols have been instrumental in many awesome distributed systems and tool sets throughout Microsoft's history. But it just so happens that these protocols are also probably the most complex, and most firewall unfriendly protocols around. It's extremely fortuitous then that Ned over at AskDS just happened to write up a magnificent explication of Microsoft RPC. (Open that link in a background tab and read it after you're done here.)

Here's the thing - what if I want to remotely monitor or interact with a machine in another country, or create a distributed system that spans continents? There are dozens of patchwork networks between the systems. Each packet between the systems traverses firewall after firewall. Suddenly, protocols such as RPC are out the window. How am I supposed to get every firewall owner from here to Timbuktu to let my RPC and/or DCOM traffic through?

That's why monitoring applications like SCOM or NetIQ AppManager require the installation of agents on the machines. They collect the data locally and then ship it to a central management server using just one or two static ports. Well, they do other more complex stuff too that requires software be installed on the machine, but that's beside the point.

Alright, enough talk. Let's get to work on gathering performance metrics remotely from a Windows server. There are a few scenarios to test here. One is communications within the boundaries of an Active Directory domain, and the other is communications with an external, non-domain machine. Then, exploring SSL authentication and encryption.

The first thing you need to do is set up and configure the WinRM service. One important thing to remember is that just starting the WinRM service isn't enough - you still have to explicitly create a listener. In addition, like most things SSL, it requires a certificate to properly authenticate and encrypt data. Run: 

winrm get winrm/config

to see the existing default WinRM configuration:

WinRM originally used ports 80 for HTTP and 443 for HTTPS. With Win7 and 2k8R2, it has changed to use ports 5985 and 5986 respectively. But those are just defaults and you can change the listener(s) back to the old ports if you want. Or any port for that matter. Run:

winrm enumerate winrm/config/listener

to list the WinRM listeners that are running. You should get nothing, because we haven't configured any listeners yet. WinRM over SSL will not work with a self-signed certificate. It has to be legit. From

"WinRM HTTPS requires a local computer "Server Authentication" certificate with a CN matching the hostname, that is not expired, revoked, or self-signed to be installed."

To set up a WinRM listener on your machine, you can run

winrm quickconfig


winrm quickconfig -transport:HTTPS

or even

winrm create winrm/config/listener?Address=*+Transport=HTTPS @{Port="443"}

Use "set" instead of "create" if you want to modify an existing listener. The @{} bit at the end is called a hash table and can be used to pass multiple values. The WinRM.cmd command line tool is actually just a wrapper for winrm.vbs, a VB script. The quickconfig stuff just runs some script that configures and starts the listener, starts and sets the WinRM service to automatic, and creates some Windows Firewall exceptions. What is more is that Powershell has many cmdlets that use WinRM, and the entire concept of Powershell Remoting uses WinRM. So now that you know the fundamentals of WinRM and what's going on in the background, let's move ahead into using Powershell. In fact, you can emulate all of the same behavior of "winrm quickconfig" by instead running 


from within Powershell to set up the WinRM service. Now from another machine, fire up Powershell and try to use the WinRM service you just set up:

$dc01 = New-PSSession -ComputerName DC01
Invoke-Command -Session $dc01 -ScriptBlock { gwmi win32_computersystem }


You just pulled some data remotely using WinRM! The difference between using a "session" in Powershell, and simply executing cmdlets using the -ComputerName parameter, is that a session persists such that you can run multiple different sets of commands that all share the same data. If you try to run New-PSSession to connect to a computer on which you have not configured the WinRM service, you will get a nasty red error. You can also run a command on many machines simultaneously, etc. Hell, it's Powershell. You can do anything.

Alright so that was simple, but that's because we were operating within the safe boundaries of our Active Directory domain and all the authentication was done in the background. What about monitoring a standalone machine, such as SERVER1?

My first test machine:

  • Hostname: SERVER1 
  • IP: 
  • OS: Windows 2008 R2 SP1, fully patched, Windows Firewall is on
  • It's not a member of any domain

First things first: Launch Powershell on SERVER1. Run:

Set-ExecutionPolicy Unrestricted

Then set up your WinRM service and listener by running


and following the prompts. If the WinRM server (SERVER1) is not in your forest (it's not) or otherwise can't use Kerberos, then HTTPS/SSL must be used, or the destination machine must be added to the TrustedHosts configuration setting. Let's try the latter first. On your client, add the WinRM server to the "Trusted Hosts" list:

We just authenticated and successfully created a remote session to SERVER1 using the Negotiate protocol! Negotiate is basically "use Kerberos if possible, fall back to NTLM if not." So the credentials are passed via NTLM, which is not clear text, but it's not awesome either. You can find a description of the rest of the authentication methods here, about halfway down the page, if you need a refresher.

Edit 1/29/2012: It should be noted that even within a domain, for Kerberos authentication to work when using WinRM, an SPN for the service must be registered in AD. As an example, you can find all of the "WSMAN" SPNs currently registered in your forest with this command:

setspn -T yourForest -F -Q WSMAN/*

SPN creation for this should have been taken care of automatically, but you know something is wrong (and Kerberos will not be used) if there is no WSMAN SPN for the device that is hosting the WinRM service.

OK, I am pooped. Time to take a break. Next time in Part II, we're going to focus on setting up SSL certificates to implement some real security to wrap up this experiment!

BlogEngine.NET, SimpleCaptcha, and Spam

I use BlogEngine.NET for this blog. I've loved it so far. It suits me perfectly because I also love .NET and C#.

BlogEngine.NET comes with a few "extensions" out of the box, and one of those extensions is called SimpleCaptcha. You simply configure it with a question and an answer. Visitors who supply the correct answer get to post comments. This wards off most of the spammers. But from what I'm seeing, is that whatever spammers use to automatically crawl the web, leaving little spam-filled coprolites in their wake, seems to be able to solve simple mathematical equations like 5+5, 3+7, and even (5+2)-1. I changed my captcha challenge to that latter equation and received a spam comment not five seconds later.

Maybe this will stop them...

So I figured the next best thing to do, without annoying and frustrating my visitors too much with those really bizarre graphical captchas that you can't even read half the time, was to change my SimpleCaptcha to something that was still simple, but required slightly more human-like thinking than what I suspect most spambots are capable of. Questions such as "what is the opposite of cold" or "a shape with four equal sides." These sorts of questions have brought my comment spam to a screeching halt. But there's one last problem: SimpleCaptcha is case sensitive and there's no immediately apparent way to turn it off. I don't want a visitor to type "Square" and not get their comment posted because they needed to have typed "square" instead.

So, to remedy this problem, simply access your web server and browse to wherever you have IIS/BlogEngine.NET installed. Then drill down to where SimpleCaptcha is. For me, it's C:\inetpub\wwwroot\App_Code\Extensions\SimpleCaptcha\. Open up the file SimpleCaptchaControl.cs in a text editor (or Visual Studio if you'd rather,) and find this method:

public void Validate(string simpleCaptchaChallenge)
   this.valid = this.skipSimpleCaptcha || this.simpleCaptchaAnswer.Equals(simpleCaptchaChallenge);

Simply change that one line to this:

public void Validate(string simpleCaptchaChallenge)
   this.valid = this.skipSimpleCaptcha || this.simpleCaptchaAnswer.Equals(simpleCaptchaChallenge,StringComparison.OrdinalIgnoreCase);

And you've just made your SimpleCaptcha not case-sensitive. The change takes effect as soon as you save the file; no restarts of anything are required.

Auditing Active Directory Inactive Users with Powershell and Other Cool Stuff

Hello again, fellow wanderers.

I was having a hell of a comment spam problem here for a couple days... hope I didn't accidentally delete any legitimate comments in the chaos. (Read this excellent comment left on my last DNS post.) Then I realized that I might ought to change the challenge question and response for my simple captcha from its default... I guess the spammers have the old "5+5=" question figured out. :P

A few years ago, I made my own simple captcha for another blog that was along the lines of x + y = ? using PHP, but x and y were randomly generated at each page load. Worked really well. The simple captcha that comes boxed with BlogEngine.NET here is static. Being able to load a random question and answer pair from a pool of questions would be a definite enhancement.

Anyway, since we're still on the topic of auditing Active Directory, I've got another one for you: Auditing "inactive" user accounts.

I had a persnickety customer that wanted to be kept abreast of all AD user accounts that had not logged on in exactly 25 days or more. As soon as one delves into this problem, one might realize that a command-line command such as dsquery user -inactive x will display users that are considered inactive for x number of weeks, but not days. I immediately suspected that there must be a reason for that lack of precision, as I knew that any sort of computer geek/engineer that wrote the dsquery utility would not have purposely left out that measure of granularity unless there was a good reason for it.

So what defines an "inactive" user? A user that has not logged on to his or her user account in a period of time. There is an AD attribute on each user called LastLogonTimeStamp. After a little research, I stumbled across this post, where it is explained that the LastLogonTimeStamp attribute is not terribly accurate - i.e., off by more than a week. Now that dsquery switch makes a lot more sense. I conjecture that the LastLogonTimeStamp attribute is inaccurate because Microsoft had to make a choice when designing Active Directory - either have that attribute updated every single time a user account is logged on to and thus amplify domain replication traffic and work for the DCs, or have it only updated periodically and save the replication load.

To further complicate matters, there is an Active Directory Powershell cmdlet called Search-ADAccount that, when it returns users, it reports a LastLogonDate attribute. As it turns out, LastLogonDate is not even a real attribute, but rather that particular Powershell cmdlet's mechanism for translating LastLogonTimeStamp into a more human-readable form. (a .NET DateTime object.)

Next, there is another AD attribute - msDS-LogonTimeSyncInterval - that you can dial down to a minimum of 1 day, and that will have replication of the users' LastLogonTimeStamp attribute updated much more frequently and thus make it more accurate. Of course, this comes at the expense of additional load on the DCs and replication traffic. This may be negligible in a small domain, but may have a significant impact on a large domain.

*ADSI Edit*

Lastly, your other options for being able to accurately track the last logon time of users as close to "real-time" as possible involve scanning the security logs or attributes on all of your domain controllers and doing some heavy parsing. This is where event forwarding and subscriptions would really shine. See my previous post for details. I don't know about you guys, but all that sounds like a nightmare to me. Being able to track inactive user accounts to within 1 day is just going to have to suffice for now.

So we made the decision to decrease the msDS-LogonTimeSyncInterval, and I wrote this nifty Powershell script to give us the good stuff. Each major chunk of code is almost identical but with a minor tweak that represents the different use cases if given different parameters. Reading the comments toward the top on the five parameters will give you a clear picture of how the script works:

# ADUserAccountAudit.ps1
# Writen by Ryan Ries on Jan 19 2012
# Requires the AD Powershell Module which is on 2k8R2 DCs and systems with RSAT installed.
# Locates "inactive" AD user accounts. Note that LastLogonTimeStamp is not terribly accurate.
# Accounts that have never been logged into will show up as having a LastLogonTimeStamp of some time
# around 1600 AD - 81 years after the death of Leonardo da Vinci.
# This is because even though their LastLogonTimeStamp attribute is null, we cast it to a DateTime object
# regardless, which converts null inputs into a minimum date, apparently.
# For specific use with NetIQ AppManager, put this script on the agent machine at 
# C:\Program Files (x86)\NetIQ\AppManager\bin\Powershell (for 64 bit Windows. Just "Program Files" if 32 bit Windows.)

Param([string]$DN = "dc=corpdom,dc=local",         # LDAP distinguished name for domain
      [string]$domainName = "Corpdom",             # This can be whatever you want it to be
      [int]$inactiveDays = 25,                     # Users that have not logged on in this number of days will appear on this report
      [bool]$includeDisabledAccounts = $false,     # Setting this to true will include accounts that are already disabled in the report as well
      [bool]$includeNoLastLogonAccounts = $false)  # Setting this to true will include accounts that have never been logged into and thus have no LastLogonTimeStamp attribute.

# First, load the Active Directory module if it is not already loaded
$ADmodule = Get-Module | Where-Object { $_.Name -eq "activedirectory" } | Foreach { $_.Name }
if($ADmodule -ne "activedirectory")
   Import-Module ActiveDirectory

if($includeDisabledAccounts -eq $false)
   if($includeNoLastLogonAccounts -eq $false)
      Write-Host "Enabled users that have not logged into $domainName in $inactiveDays days`r`nExcluding accounts that have never been logged into`r`nAccounts younger than $inactiveDays days not shown.`r`n-------------------------------------------------------"
      Search-ADAccount -UsersOnly -SearchBase "$DN" -AccountInactive -TimeSpan $inactiveDays`.00:00:00 | 
      Where-Object {$_.Enabled -eq $true -And $_.LastLogonDate -ne $null } |
      Get-ADUser -Properties Name, sAMAccountName, givenName, sn, lastLogonTimestamp, Enabled, WhenCreated |
      Where-Object {$_.WhenCreated -lt (Get-Date).AddDays(-$($inactiveDays)) } |
      Select sAMAccountName, givenName, sn, @{n="LastLogonTimeStamp";e={[DateTime]::FromFileTime($_.LastLogonTimestamp)}}, Enabled, WhenCreated |
      Sort-Object LastLogonTimeStamp |
      Write-Host "Enabled users that have not logged into $domainName in $inactiveDays days`r`nIncluding accounts that have never been logged into`r`nAccounts younger than $inactiveDays days not shown.`r`n-------------------------------------------------------"
      Search-ADAccount -UsersOnly -SearchBase "$DN" -AccountInactive -TimeSpan $inactiveDays`.00:00:00 | 
      Where-Object {$_.Enabled -eq $true } |
      Get-ADUser -Properties Name, sAMAccountName, givenName, sn, lastLogonTimestamp, Enabled, WhenCreated |
      Where-Object {$_.WhenCreated -lt (Get-Date).AddDays(-$($inactiveDays)) } |
      Select sAMAccountName, givenName, sn, @{n="LastLogonTimeStamp";e={[DateTime]::FromFileTime($_.LastLogonTimestamp)}}, Enabled, WhenCreated |
      Sort-Object LastLogonTimeStamp |
   if($includeNoLastLogonAccounts -eq $false)
      Write-Host "All users that have not logged into $domainName in $inactiveDays days`r`nExcluding accounts that have never been logged into`r`nAccounts younger than $inactiveDays days not shown.`r`n------------------------------------------------------"   
      Search-ADAccount -UsersOnly -SearchBase "$DN" -AccountInactive -TimeSpan $inactiveDays`.00:00:00 |
      Where-Object { $_.LastLogonDate -ne $null } |
      Get-ADUser -Properties Name, sAMAccountName, givenName, sn, lastLogonTimestamp, Enabled, WhenCreated |
      Where-Object { $_.WhenCreated -lt (Get-Date).AddDays(-$($inactiveDays)) } |
      Select sAMAccountName, givenName, sn, @{n="LastLogonTimeStamp";e={[DateTime]::FromFileTime($_.lastlogontimestamp)}}, Enabled, WhenCreated |
      Sort-Object LastLogonTimeStamp |
      Write-Host "All users that have not logged into $domainName in $inactiveDays days`r`nIncluding accounts that have never been logged into`r`nAccounts younger than $inactiveDays days not shown.`r`n------------------------------------------------------"   
      Search-ADAccount -UsersOnly -SearchBase "$DN" -AccountInactive -TimeSpan $inactiveDays`.00:00:00 |
      Get-ADUser -Properties Name, sAMAccountName, givenName, sn, lastLogonTimestamp, Enabled, WhenCreated |
      Where-Object {$_.WhenCreated -lt (Get-Date).AddDays(-$($inactiveDays)) } |
      Select sAMAccountName, givenName, sn, @{n="LastLogonTimeStamp";e={[DateTime]::FromFileTime($_.lastlogontimestamp)}}, Enabled, WhenCreated |
      Sort-Object LastLogonTimeStamp |

So there you have it, a quick and dirty report to locate users that have been inactive for over x days. Accounts that were just created and not logged on to yet would have a LastLogonTimeStamp of null and would therefore show up in this report, so I threw the Where-Object {$_.WhenCreated -lt (Get-Date).AddDays(-$($inactiveDays)) } bit in there to exclude in any case the user accounts that were younger than the specified number of days required to consider an account "inactive." Furthermore, you might want to resist the urge just now to go a step further and programmatically disable inactive user accounts. Most organizations use service accounts and other special accounts that may not get logged into very often, and yet, all hell would break loose if you disabled them. I'm considering a system that disables the accounts, but also reads in a list of accounts which are "immune" and would therefore be ignored by the program. For a future post I guess.

Lastly, I want to thank Ned of the AskDS blog, without whom this post would not have been possible. (Now it sounds like a Grammy speech...) But seriously, I asked him about this stuff and he knew all the answers right away. Helped me out immeasurably on this.

Auditing Active Directory User Creation: A Simple Approach

Hello again. Since websites like reddit, Wikipedia and plenty others are blacked out today in protest of the Internet censorship bills SOPA and PIPA, it gives me plenty of time that I would have otherwise wasted surfing the web to contact my representatives and tell them that I, as a constituent, strongly urge them to reconsider their support of these bills... and then to write a blog post about Active Directory change auditing.

Recently, someone explained to me how in their company, they had some third-party software foisted upon them that automatically generated new user accounts. I don't know what the software was for, but understandably, this made him feel a little uncomfortable. We administrators don't particularly enjoy giving HAL-9000 the keys to manipulate our Active Directories with little insight into what it's actually doing.

So with that in mind, he asked me if there was a way to audit new user account creation, and then to go a step further and actually perform some action whenever a new user account was created.

There are lots of third-party Active Directory auditing tools that companies would love to sell you, but let's put on our engineer hats and bang something out using only built-in Windows tools. Let's pretend that our boss just told us there's no budget for buying new software and this task must be completed by lunch, or else you're fired. There are undoubtedly many different ways of going about auditing Active Directory changes, and this is but one way. It may or may not be the best way, but perhaps it will give you some ideas. This information is written specifically using Windows 2008 R2.

When a new user account is created, a slew of events are recorded in the Security event log on the domain controller on which the user account was created. In order of occurrence:

  • 4720 - A user account was created.
  • 4724 - An attempt was made to reset an account's password.
  • 4738 - A user account was changed. (Repeated 4x)
  • 4722 - A user account was enabled.

If you only have one domain controller in your domain, you can pretty much stop right here - your work is done.  Simply right-click the event in Event Viewer, select "Attach Task To This Event," and insert the name of your Powershell script or executable or email address you want to send notification to, etc.

But most of us have more than one domain controller, and those aforementioned Security events are not logged on every domain controller - only the DC on which the user was initially created, and there's no practical way to ensure that user accounts are only created on one DC. I was hoping that since the PDC Emulator is involved in every password reset, that I would at least get an event on my PDCe that implied user account creation had taken place on another DC, but I found no such events on the PDCe. There was only a generic Logon event originating from the auxiliary DC at the exact moment that the user account was created. Furthermore, even if I had found an event 4724 on the PDCe, there probably would have been no way to distinguish between that event and one that accompanied an existing user's routine password change anyway.

So to solve for this, let's set up event subscriptions! (I suppose you could just go around and set up identical tasks on each DC... but I want to do event subscriptions!) On the server that you want to collect events from other sources, just click "Subscriptions" in the left pane of Event Viewer:

*Do it!*

I just happened to choose my main DC as the event subscriber for this test. It should also be noted that at the command line, you can use wecutil.exe and its brother wevtutil.exe to accomplish these same goals, but we're going to use the GUI.

Now right-click on Subscriptions and Create Subscription:

Fill out the information. You're going to want your subscriber to go get events from your other DC. When you select the computers from which you want to collect events, you can test them before you commit the changes, which is nice. You're going to want to make sure that the Windows Remote Management (WS-Management) service, also known as WinRM, is running... and also that it is configured. To do this, simply run winrm quickconfig on all the machines involved. This can also be done via GPO so that your new machines will be configured automatically as they're deployed.

Now the connectivity test from your subscriber should succeed, and you'll be ready to subscribe to events from the other machine. If the test is still failing, double check Windows Firewall, any other firewalls in the way, that the WinRM service is running and configured on the remote machine, and name resolution. Now back on our event collector machine, make sure and set up your filter to only get Security event 4720's.

Alright you're done! Now at this point, events from DC02 will pop up in the "Forwarded Events" log on DC01. If you have any problems with your forwarded events not showing up, right-click on the subscription and choose "Runtime Status". This will alert you to any additional problems. In my case, I was still getting an "Access Denied" when trying to read the logs on DC02. The reason was that the subscription was configured to run under the Machine Account. I switched it to a user account that had the correct permissions to read the logs on DC02, and it worked just fine. If you get just an EventID 111 in the Forwarded Events log on your collector, remember that you need to run winrm quickconfig on both machines - the forwarder and the forwardee.

You can now attach a custom task to either these forwarded events, or the entire Forwarded Events log as a whole.

DNS 101: Round Robin (Or Back When I was Young And Foolish Part II)

I learned something today. It's something that made me feel stupid for not knowing. Something that seemed elemental and trivial - yet, I did not know it. So please, allow me to relay my somewhat embarrassing learning experience in the hopes that it will save someone else from the same embarrassment.

I did know what DNS round robin was. Or at least, I would have said that I did.

Imagine you configure DNS1, as a DNS server, to use round robin. Then, you create 3 host (A or AAAA) records for the same host name, using different IPs. Let's say we create the following A records on DNS1:

server01 - A
server01 - A
server01 - A

Then on a workstation which is configured to use DNS1 as a DNS server, you ping server01. You receive as a reply. You ping server01 again. With no hesitation, you get a reply from again. We assume that your local workstation has cached locally and will reuse that IP for server01 until the entry either expires, or we flush the DNS cache on the workstation with a command like ipconfig/flushdns.

I run ipconfig/flushdns. Then I ping server01 again.

This time I receive a response from Now I assume DNS round robin is working perfectly. I go home for the day feeling like I know everything there is to know about DNS.

But was it that the DNS server is responding to DNS queries with the single next A/AAAA record that it has on file, in a round-robin type sequential fashion to every DNS query that it receives? That is what I assumed.

But the fact of the matter is that DNS servers, when queried for a host name, actually return a list of all A/AAAA records associated with that host name, every time that host name is queried for. (To a point - the list must fit within a UDP packet, and some firewalls/filters don't let UDP packets longer than 512 bytes through. That's changing though. Our idea of how big data is and should be allowed to be is always growing.)

I assume that, being one of the busiest websites in the world, has not only some global load balancing and other advanced load balancing techniques employed, but probably also has more than one host record associated with it. To test my theory, I fire up Wireshark and start a packet capture. I then flush my local DNS cache with ipconfig/flushdns and then ping

Notice how I pinged it, got one IP address in response (.148), then flushed my DNS cache, pinged it again and got another different IP address (.144)? But despite what it may look like, that name server is not returning just one A/AAAA record each time I query it:

*Click for Larger*

My workstation is ::9. My workstation's DNS server is ::1. The DNS server is configured to forward DNS requests for zones for which it is not authoritative on to yet another DNS server. So I ask for, my DNS server doesn't know, so it forwards the request. The forwardee finally finds out and reports back to my DNS server, which in turn relays back to me a list of all the A records for I get a long list containing not only a mess of A records, but a CNAME thrown in there too, all from a single DNS query! (We're not worried about the subsequent query made for an AAAA record right now. Another post perhaps.)

I was able to replicate this same behavior in a sanitary lab environment running a Windows DNS server and confirmed the same behavior. (Using the server01 example I mentioned earlier.)

Where round robin comes in is that it rotates the order of the list given to each subsequent client who requests it. Keep in mind that while round robin-ing the A records in your DNS replies does supply a primitive form of load distribution, it's a pretty poor substitute for real load balancing, since if one of the nodes in the list goes down, the DNS server will be none the wiser and will continue handing out the list with the downed node's IP address on it.

Lastly, since we know that our client is receiving an entire list of A records for host names which have many IP addresses, what does it actually do with the list?  Well, the ping utility doesn't do much. If the first IP address on the list is down, you get a destination unreachable message and that's it. (Leading to a lot of people not realizing they have a whole list of IPs they could try.) Web browsers however, have a nifty feature known as "browser retry" or "client retry," where they will continue trying the other IPs in the list until they find a working one. Then they will cache the working IP address so that the user does not continue to experience the same delay in web page loading as they did the first time. Yes, there are exploits concerning this feature, and yes it's probably a bad idea to rely on this since browser retry is implemented differently across every different browser and operating system. It's a relatively new mechanism actually, and people may not believe you if you tell them. To prove it to them, find (or create) a host name which has several bad IPs and one or two good ones. Now telnet to that hostname. Even telnet (a modern version from a modern operating system) will use getaddrinfo() instead of gethostbyname() and if it fails to connect the first IP, you can watch it continue trying the next IPs in the list.

More info here, here and here. That last link is an MSDN doc on getaddrinfo(). Notice that it does talk about different implementations on different operating systems, and that ppResult is "a pointer to a linked list of one or more addrinfo structures that contains response information about the host."