Building a Windows Server That Boots With No Errors (Pt. 1 of NaN)

One of my favorite pastimes (hey don't judge me) is properly configuring my Windows Servers so that they complete a cold boot and log on of a user without a single error in the Application or System event logs.  Even if the errors that are logged don't seem to have any impact on the system, I still don't want to see them.  Maybe some people don't care that errors are being generated by Windows during bootup, as long as the server still "works fine," but I do.  This might sound silly to you at first, but take Windows 2008 R2 SP1 as an example.   Right off the shelf, freshly installed with no modification or installed applications whatsoever, it is unable to boot and log on a user without logging an error in the event viewer.  Windows usually requires at least a little bit of jiggering out of the box before it boots with no errors in the event logs.  And that's before you even begin installing applications and changing up configurations into a virtually infinite number of permutations that could cause errors during system boot.

And that brings me to #1:

WMI Error 10*WMI Error 10*

Microsoft left this little bit of errant code buried deep within the bowels of their service pack 1 update for Windows 2008 R2 and Windows 7.  You'll see it once every time you boot the machine.  Luckily, it's a snap to fix.  Just save this VBscript as a *.vbs file and run it.  All it does is remove the vestigial reference that causes the error:

strComputer = "."
Set objWMIService = GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" _
& strComputer & "\root\subscription")

Set obj1 = objWMIService.ExecQuery("select * from __eventfilter where name='BVTFilter' and query='SELECT * FROM __InstanceModificationEvent WITHIN 60 WHERE TargetInstance ISA ""Win32_Processor"" AND TargetInstance.LoadPercentage > 99'")

For Each obj1elem in obj1
  set obj2set = obj1elem.Associators_("__FilterToConsumerBinding")
  set obj3set = obj1elem.References_("__FilterToConsumerBinding")

  For each obj2 in obj2set
    WScript.echo "Deleting the object"
    WScript.echo obj2.GetObjectText_
    obj2.Delete_
  next

  For each obj3 in obj3set
    WScript.echo "Deleting the object"
    WScript.echo obj3.GetObjectText_
    obj3.Delete_
  next

  WScript.echo "Deleting the object"
  WScript.echo obj1elem.GetObjectText_
  obj1elem.Delete_
Next

Here's the MS KB.

Here's another easy one... #2:

Error 7030*Service Control Manager Error 7030*

I've seen the SNMP service (which Microsoft is deprecating,) trigger this one on a Windows 2008 R2 SP1 server, as well as numerous third-party services.  The bottom line is that services aren't allowed to interact with the desktop anymore since 2008/Vista.  It's not supported by Microsoft any more, it's never a good idea, and if you are thinking about writing a Windows service that is meant to interact with the desktop of a logged-on user, then you should rethink it because your idea is wrong and stupid.  Most of all on a server.  I can think of few things that annoy me more than crusty old line-of-business applications that run on servers, but weren't actually designed for servers. (i.e., with GUI interaction required, stupid stuff like requiring that a person be logged on interactively to launch the app and then stay logged on indenfinitely, etc.)  Ugh.

The fix is to simply uncheck the "Allow service to interact with the desktop" checkbox in the properties of the service.  It is not supported any more and is only still there for compatibility with legacy code.  If a service (which runs in Session 0) tries to interact with the user's desktop, the user will get a popup message like this:

Interactive service*Session 0 Knocking!*

If you view the message, your own desktop will be suspended and you will be transferred to the twilight zone of Session 0's desktop, until you choose to return. The bottom line is if you have a Windows service that does this on modern versions of Windows, then that service is not compatible with Server 2008 and above, no matter what the vendor says.  As a developer, if you really need this ability to interact with a user's desktop, which you don't, you might consider doing something like using the Terminal Services API (wtsapi32.dll) to identify the active user sessions and starting a process in that session.

If you want to see whether a Windows service is configured to interact with the desktop using Powershell:

$svc = $(Get-WmiObject -Query "select * from win32_service where Name = 'MyService").Properties
$svc["DesktopInteract"].Value

False

Microsoft schools us on thie issue here and here.

And finally, #3:

DCOM Error 10016
*DCOM Error 10016*

This is the most interesting error of the three in my opinion.  It typically only happens after you start installing applications.  The event tells you that you should go have a look at the Component Services admin tool (mmc snapin,) but frankly a lot of admins don't know much about what the Component Services snapin does or how it works.  It can be somewhat daunting:

DCOM MMC

Well we know the error is telling us that a particular account (such as SYSTEM S-1-15-18) doesn't have some sort of access to some particular doodad in here.  But how do we find that AppID GUID that it mentions?  We could be idiots and right-click on every single node in this entire snapin one at a time... or we could be smart and take our search to the registry.  Look for HKEY_CLASSES_ROOT\AppID\{APPID-GUID}. That should tell you the name of the offending COM component.  All you have to do now is go back to the Component Services snapin, find the name of that component, go the security properties of it, and edit the security ACL of that component such that what ever account the event log was bitching about is given whatever access it wanted.  If you find that that the security properties of the component are greyed out so that you can't edit it, that's probably because TrustedInstaller has that on lockdown.  Go back to the registry, find the corresponding reg key, take ownership/give yourself permissions to it as necessary, restart the service (or reboot the OS,) and then you will be able to modify the security settings on that COM component.

I saw this myself just yesterday with the "SMS Agent" DCOM application.  The SMS (or SCCM) agent came preinstalled on the standard OS image that was being deployed to the machines I was working on.

So this has been the first in a series of me sharing one of my personal hobbies - making sure there are no application or system errors when Windows boots up.  If you have any boot or logon errors in the same spirit as those I've discussed here, feel free to drop me a line and I will feature your errors in a future post!

Azure Outage, The File Cabinet Blog, Etc.

Got a mixed bag this weekend... I'm still busier than usual at work, so my thoughts have been more scattered lately. I'll just start typing and we'll see where it takes us.

First, Windows Azure. I just put this very blog on Azure not two days ago, and then yesterday they suffered a massive, embarrassing, world-wide secure storage outage. I say embarrassing because it was caused by an expired SSL certificate. That's right - a world-wide outage lasting hours that could have been prevented by someone taking 5 minutes to renew a certificate.

But let's get our facts straight here - Windows Azure didn't completely go down. It was specifically their secure storage services that went down, which I heard also affected provisioning of new resources, as well as a lot of unhappy customers who were running large storage and SQL operations over SSL that relied on that certificate. The outage didn't affect any HTTP or non-SSL traffic, so this blog just sat here relatively unscathed. Of course, I feel for those who put enterprise workloads up on Azure and did get hurt by the outage, but Azure is certainly not the first public cloud service to suffer a large-scale outage, and it won't be the last. But what makes this outage so particularly poignant is that it was so easily preventable with even the most basic of maintenance plans. Such as, I dunno, maybe a sticky note on someone's monitor reminding them to renew this immensely important SSL cert.

Here are a couple of screenshots, for the Schadenfreude:

 

Technet Forums

 

Azure Dashboard

 

Next thing on the list is pretty old news, but I never mentioned anything about it here. Ned Pyle, formerly of the Ask the Directory Services Team blog, has moved to Microsoft's home base and is now on the storage product team. He's still blogging though, on the File Cabinet blog. I don't know him well, but we have exchanged a few emails and blog comments about AD stuff and such. Up there with Russinovich in my opinion, he is one of my favorite tech people on the internet though, because his blogging is both entertaining and very educational. Lots of respect. Plus, check out the comments on his latest post here, where I correctly answer some AD arcana and get some validation from one of my heroes.

Now if I can just get the PFEs to make another one of those Microsoft Certified Master Active Directory posts, I'd be as happy as a civet eatin' coffee berries.

Now Powered By Windows Azure

I transferred this blog to Windows Azure today. Up until today, I've been hosting this blog from inside my home. While I can boast almost zero unplanned downtime even from my mostly consumer-grade hardware and residential internet connection, I felt it was time to hoist this blog up into a slightly more professional and resilient environment. That way I don't have to worry about backups and hardware failures on my own gear taking down this blog. And it gives me more flexibility in terms of being able to tear my home lab apart and rearrange it without having to affect this blog.

It was very easy to move the blog to the new VM on Azure, which is good, because I've been so busy at work lately that I've had time for little else... such as blogging. I had to sign up especially for the "preview" of VM hosting from Azure, as it is apparently still in the preview phase. You get about a 50% discount until it goes General Availability. Anyway, both the virtual machine and the portal have both worked perfectly so far and I would not be surprised if it was really close to GA. The portal looks nice, polished, and works well. Comes with a nice, basic resource monitor so you can see your VM's CPU, memory, network usage, etc. over time from the web portal. The price is pretty low. Definitely lower than some other providers. My external IP address won't change. Plus they can host Server 2012 VMs, which some other providers are still catching up to. I chose the absolute slowest, lowest-spec VM that they would give me, because of the price. So the machine is a little slower than it was running on my own gear, but it's still enough for this measly blog. After the VM was done being imaged, I loaded IIS and SMTP on it, simply dumped the whole inetpub directory straight from my home machine into the VM, configured SMTP (so comment emails can be sent from this blog to my Gmail account, etc.,) and then turned the GUI off on the server with Powershell and logged out. Piece of cake.

Azure actually offers a lot of different hosting capabilities - not just virtual machines. In fact I don't think Azure even realized that they would offer IaaS when they first set out. But I chose the VM option because I'm most comfortable managing the OS myself... learning how to set up Visual Studio to publish websites from my desktop straight into an Azure service is totally new to me and I haven't even begun learning how to do that yet.

Why Are You Talking To THAT Domain Controller!?

I was in Salt Lake City most of this week. Being surrounded by stark snow-covered mountains made for some wonderful scenery... it could not be more different than it is here in Texas. Plus I got to meet and greet with a bunch of Novell and NetIQ people. And eat an enormous bone-in ribeye that no human being has any business eating in one sitting.

But anyway, here's a little AD mystery I ran in to a couple weeks ago, and it may not be as simple as you first think.

As Active Directory admins, something we're probably all familiar with is member servers authenticating with the "wrong" domain controller. By wrong, I mean a DC that is in a different site than the member server, when there's a perfectly fine DC right there in the same site as the member server, and so the member server is incurring cross-site communication when it doesn't need to be. Everything might still function well enough as long as the communication between DC and member server is successful, but now you're saturating your slower inter-site WAN links with AD traffic when you don't need to be. You should want your AD replication, group policy application, DFS referrals, etc., to run like a well-oiled machine.

I often work in a huge environment with AD sites in many countries and on multiple continents, and thousands of little /26 subnets that can't always be easily grouped into a predictable supernet for the purposes of linking subnets to sites in AD Sites & Subnets. So I'm always alert to the fact that if I log on to a server, and I notice that logon takes an abnormally long time, I very well could be logging on to the wrong DC. First, I run set log to see which DC I have logged on to:

set log*DC01 is in Amsterdam*

So in this case, I noticed that while I had logged on to a member server in Dallas, that server's logon server was a DC in Europe. :(

You immediately think "The server's IP subnet isn't defined in AD Sites & Services or is associated to the wrong site," don't you?  Yeah, me too. So I went and checked. Lo and behold, the server's IP subnet was properly defined and associated to the correct site in AD.

Now we have a puzzle. Back on the member server, I run nltest /dsgetsite to verify that the domain member does know to which site it belongs. (Which the domain member's NetLogon service stores in the registry in the DynamicSiteName value once it's discovered.)

I also ran nltest /dsgetdc:domain.com /Account:server01$ to essentially emulate the DC locator and selection process for that server, which basically just confirmed what we already knew:

C:\Users\Administrator>nltest /dsgetdc:domain.com /Account:server01$ 
           DC: \\DC01.DOMAIN.COM (In Amsterdam) 
      Address: \\10.0.2.55 
     Dom Guid: blah-blah-blah 
     Dom Name: DOMAIN.COM 
  Forest Name: DOMAIN.COM 
 Dc Site Name: Amsterdam 
Our Site Name: Arlington 
        Flags: GC DS LDAP KDC TIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST FUL
L_SECRET WS 
The command completed successfully

So where do we look next if there's no problem with the IP subnets in AD Sites & Services?  I'm going with DNS. We know that domain controllers register site-specific SRV records so that clients who know to which site they belong will know what DNS query to make to find domain controllers specific to their own site.  So what DNS records did we find for the Arlington site?

Forward Lookup Zones
    _msdcs.domain.com
        dc
            _sites
                Arlington
                    _kerberos SRV NewYorkDC
                    _kerberos SRV SanDiegoDC
                    _kerberos SRV MadridDC
                    _kerberos SRV ArlingtonDC
                    _ldap     SRV NewYorkDC
                    _ldap     SRV SanDiegoDC
                    _ldap     SRV MadridDC
                    _ldap     SRV ArlingtonDC

OK, now things are getting weird.  All of these other domain controllers that are not part of the Arlington site have registered their SRV records in the Arlington site.  The only way I can imagine that happening is because of Automatic Site Coverage, whereby domain controllers will register their own SRV records into sites where it is detected that the site has no domain controllers of its own... combined with the fact that scavenging is turned off for the DNS server, including the _msdcs zone.  So someone, once upon a time, must have created the Arlington site in AD before the actual domain controllers for Arlington were ready.  What's more is that Automatic Site Coverage is supposed to intelligently use site link costing so that only the domain controllers in the next closest site provide "coverage" for the site with no DCs, not every DC in the domain. Turns out the domain did not have a site link strategy either - it used DEFAULTIPSITELINK for everything - the entire global infrastructure. So even after Arlington did get some domain controllers, the SRV records from all the other DCs stayed there because of no scavenging.

Here's the thing though - did you notice that almost every other domain controller in the domain had SRV records registered in the Arlington site, except for the domain controller in Amsterdam that our member server actually authenticated to!?

This is getting kinda' nuts.  So what else, besides the DNS query, does a member server perform in order to locate a suitable domain controller?

So after a client does a DNS query for _ldap._tcp.SITENAME._sites.ForestDnsZones.domain.com, and gets a response, the client then begins to do LDAP queries against the DCs given in the DNS response to make sure that the DCs are alive and servicing requests. If you want to see this for yourself, I recommend starting Wireshark, and then restarting the NetLogon service while the capture is running. If it turns out that none of the DCs in the list that was returned by the site-specific DNS query is responding to your LDAP queries, then the client has to back up and try again with a domain-wide query.

And that is what was happening. The client, server01, was getting a list of DCs for its site, even ones that were erroneously there, but I confirmed that it was unable to contact any of those domain controllers over port 389. So after that failed, the server was forced to try again with a domain-wide query, where it finally found one domain controller that it could perform an LDAP query on... a domain controller in Amsterdam.

Moral of the story: Always blame the network guys.