Building a Windows Server That Boots With No Errors (Pt. 1 of NaN)

One of my favorite pastimes (hey don't judge me) is properly configuring my Windows Servers so that they complete a cold boot and log on of a user without a single error in the Application or System event logs.  Even if the errors that are logged don't seem to have any impact on the system, I still don't want to see them.  Maybe some people don't care that errors are being generated by Windows during bootup, as long as the server still "works fine," but I do.  This might sound silly to you at first, but take Windows 2008 R2 SP1 as an example.   Right off the shelf, freshly installed with no modification or installed applications whatsoever, it is unable to boot and log on a user without logging an error in the event viewer.  Windows usually requires at least a little bit of jiggering out of the box before it boots with no errors in the event logs.  And that's before you even begin installing applications and changing up configurations into a virtually infinite number of permutations that could cause errors during system boot.

And that brings me to #1:

WMI Error 10*WMI Error 10*

Microsoft left this little bit of errant code buried deep within the bowels of their service pack 1 update for Windows 2008 R2 and Windows 7.  You'll see it once every time you boot the machine.  Luckily, it's a snap to fix.  Just save this VBscript as a *.vbs file and run it.  All it does is remove the vestigial reference that causes the error:

strComputer = "."
Set objWMIService = GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" _
& strComputer & "\root\subscription")

Set obj1 = objWMIService.ExecQuery("select * from __eventfilter where name='BVTFilter' and query='SELECT * FROM __InstanceModificationEvent WITHIN 60 WHERE TargetInstance ISA ""Win32_Processor"" AND TargetInstance.LoadPercentage > 99'")

For Each obj1elem in obj1
  set obj2set = obj1elem.Associators_("__FilterToConsumerBinding")
  set obj3set = obj1elem.References_("__FilterToConsumerBinding")

  For each obj2 in obj2set
    WScript.echo "Deleting the object"
    WScript.echo obj2.GetObjectText_
    obj2.Delete_
  next

  For each obj3 in obj3set
    WScript.echo "Deleting the object"
    WScript.echo obj3.GetObjectText_
    obj3.Delete_
  next

  WScript.echo "Deleting the object"
  WScript.echo obj1elem.GetObjectText_
  obj1elem.Delete_
Next

Here's the MS KB.

Here's another easy one... #2:

Error 7030*Service Control Manager Error 7030*

I've seen the SNMP service (which Microsoft is deprecating,) trigger this one on a Windows 2008 R2 SP1 server, as well as numerous third-party services.  The bottom line is that services aren't allowed to interact with the desktop anymore since 2008/Vista.  It's not supported by Microsoft any more, it's never a good idea, and if you are thinking about writing a Windows service that is meant to interact with the desktop of a logged-on user, then you should rethink it because your idea is wrong and stupid.  Most of all on a server.  I can think of few things that annoy me more than crusty old line-of-business applications that run on servers, but weren't actually designed for servers. (i.e., with GUI interaction required, stupid stuff like requiring that a person be logged on interactively to launch the app and then stay logged on indenfinitely, etc.)  Ugh.

The fix is to simply uncheck the "Allow service to interact with the desktop" checkbox in the properties of the service.  It is not supported any more and is only still there for compatibility with legacy code.  If a service (which runs in Session 0) tries to interact with the user's desktop, the user will get a popup message like this:

Interactive service*Session 0 Knocking!*

If you view the message, your own desktop will be suspended and you will be transferred to the twilight zone of Session 0's desktop, until you choose to return. The bottom line is if you have a Windows service that does this on modern versions of Windows, then that service is not compatible with Server 2008 and above, no matter what the vendor says.  As a developer, if you really need this ability to interact with a user's desktop, which you don't, you might consider doing something like using the Terminal Services API (wtsapi32.dll) to identify the active user sessions and starting a process in that session.

If you want to see whether a Windows service is configured to interact with the desktop using Powershell:

$svc = $(Get-WmiObject -Query "select * from win32_service where Name = 'MyService").Properties
$svc["DesktopInteract"].Value

False

Microsoft schools us on thie issue here and here.

And finally, #3:

DCOM Error 10016
*DCOM Error 10016*

This is the most interesting error of the three in my opinion.  It typically only happens after you start installing applications.  The event tells you that you should go have a look at the Component Services admin tool (mmc snapin,) but frankly a lot of admins don't know much about what the Component Services snapin does or how it works.  It can be somewhat daunting:

DCOM MMC

Well we know the error is telling us that a particular account (such as SYSTEM S-1-15-18) doesn't have some sort of access to some particular doodad in here.  But how do we find that AppID GUID that it mentions?  We could be idiots and right-click on every single node in this entire snapin one at a time... or we could be smart and take our search to the registry.  Look for HKEY_CLASSES_ROOT\AppID\{APPID-GUID}. That should tell you the name of the offending COM component.  All you have to do now is go back to the Component Services snapin, find the name of that component, go the security properties of it, and edit the security ACL of that component such that what ever account the event log was bitching about is given whatever access it wanted.  If you find that that the security properties of the component are greyed out so that you can't edit it, that's probably because TrustedInstaller has that on lockdown.  Go back to the registry, find the corresponding reg key, take ownership/give yourself permissions to it as necessary, restart the service (or reboot the OS,) and then you will be able to modify the security settings on that COM component.

I saw this myself just yesterday with the "SMS Agent" DCOM application.  The SMS (or SCCM) agent came preinstalled on the standard OS image that was being deployed to the machines I was working on.

So this has been the first in a series of me sharing one of my personal hobbies - making sure there are no application or system errors when Windows boots up.  If you have any boot or logon errors in the same spirit as those I've discussed here, feel free to drop me a line and I will feature your errors in a future post!

Azure Outage, The File Cabinet Blog, Etc.

Got a mixed bag this weekend... I'm still busier than usual at work, so my thoughts have been more scattered lately. I'll just start typing and we'll see where it takes us.

First, Windows Azure. I just put this very blog on Azure not two days ago, and then yesterday they suffered a massive, embarrassing, world-wide secure storage outage. I say embarrassing because it was caused by an expired SSL certificate. That's right - a world-wide outage lasting hours that could have been prevented by someone taking 5 minutes to renew a certificate.

But let's get our facts straight here - Windows Azure didn't completely go down. It was specifically their secure storage services that went down, which I heard also affected provisioning of new resources, as well as a lot of unhappy customers who were running large storage and SQL operations over SSL that relied on that certificate. The outage didn't affect any HTTP or non-SSL traffic, so this blog just sat here relatively unscathed. Of course, I feel for those who put enterprise workloads up on Azure and did get hurt by the outage, but Azure is certainly not the first public cloud service to suffer a large-scale outage, and it won't be the last. But what makes this outage so particularly poignant is that it was so easily preventable with even the most basic of maintenance plans. Such as, I dunno, maybe a sticky note on someone's monitor reminding them to renew this immensely important SSL cert.

Here are a couple of screenshots, for the Schadenfreude:

 

Technet Forums

 

Azure Dashboard

 

Next thing on the list is pretty old news, but I never mentioned anything about it here. Ned Pyle, formerly of the Ask the Directory Services Team blog, has moved to Microsoft's home base and is now on the storage product team. He's still blogging though, on the File Cabinet blog. I don't know him well, but we have exchanged a few emails and blog comments about AD stuff and such. Up there with Russinovich in my opinion, he is one of my favorite tech people on the internet though, because his blogging is both entertaining and very educational. Lots of respect. Plus, check out the comments on his latest post here, where I correctly answer some AD arcana and get some validation from one of my heroes.

Now if I can just get the PFEs to make another one of those Microsoft Certified Master Active Directory posts, I'd be as happy as a civet eatin' coffee berries.

Now Powered By Windows Azure

I transferred this blog to Windows Azure today. Up until today, I've been hosting this blog from inside my home. While I can boast almost zero unplanned downtime even from my mostly consumer-grade hardware and residential internet connection, I felt it was time to hoist this blog up into a slightly more professional and resilient environment. That way I don't have to worry about backups and hardware failures on my own gear taking down this blog. And it gives me more flexibility in terms of being able to tear my home lab apart and rearrange it without having to affect this blog.

It was very easy to move the blog to the new VM on Azure, which is good, because I've been so busy at work lately that I've had time for little else... such as blogging. I had to sign up especially for the "preview" of VM hosting from Azure, as it is apparently still in the preview phase. You get about a 50% discount until it goes General Availability. Anyway, both the virtual machine and the portal have both worked perfectly so far and I would not be surprised if it was really close to GA. The portal looks nice, polished, and works well. Comes with a nice, basic resource monitor so you can see your VM's CPU, memory, network usage, etc. over time from the web portal. The price is pretty low. Definitely lower than some other providers. My external IP address won't change. Plus they can host Server 2012 VMs, which some other providers are still catching up to. I chose the absolute slowest, lowest-spec VM that they would give me, because of the price. So the machine is a little slower than it was running on my own gear, but it's still enough for this measly blog. After the VM was done being imaged, I loaded IIS and SMTP on it, simply dumped the whole inetpub directory straight from my home machine into the VM, configured SMTP (so comment emails can be sent from this blog to my Gmail account, etc.,) and then turned the GUI off on the server with Powershell and logged out. Piece of cake.

Azure actually offers a lot of different hosting capabilities - not just virtual machines. In fact I don't think Azure even realized that they would offer IaaS when they first set out. But I chose the VM option because I'm most comfortable managing the OS myself... learning how to set up Visual Studio to publish websites from my desktop straight into an Azure service is totally new to me and I haven't even begun learning how to do that yet.

Users.exe v1.0.0.3

In my last post, I showed how to get RDP/TS session information from a local or remote computer, including ClientName, on the command line using Powershell. (But it's not really Powershell. It's a PS wrapper around some C# that P/Invokes a native API. Which is pretty nuts but it works.)  Since I can't think of any other way to get this information on the command line, I thought it would be useful to convert the thing to native code. I named the program users.exe. It's written in C. Here is what it looks like:

C:\Users\joe\Desktop>users /?

USAGE: users.exe [hostname]

Not specifying a hostname implies localhost.
This command will return information about users currently logged onto
a local or remote computer, including the client's hostname and IP.
Users.exe was written by Ryan Ries.

C:\Users\joe\>users SERVER01

Session ID  : 0
Domain\User : System
Client Name : Local
Net Address : n/a
Conn. State : Disconnected

Session ID  : 1
Domain\User : System
Client Name : Local
Net Address : n/a
Conn. State : Connected

Session ID  : 25
Domain\User : DOMAIN\joe
Client Name : JOESLAPTOP
Net Address : 10.122.124.21 (AF_INET)
Conn. State : Active

I had a pretty rough time shaking the rust off of my native/unmanaged code skills, but I pulled it off.  That said, if you wanna give it a try, I would very much appreciate any bug reports/feature requests.

users.exe (63.50 kb)

An Interesting DFSR Change (that probably everyone knew about but me)

First, wooo I hit 5K on ServerFault today.

I'm embarrassed to say that something I read about recently but didn't pay enough attention to at the time officially just bit me in the butt.  

A significant change occurred in January 2012 in the way that DFS Replication behaves.  Windows Server 2008 R2 SP1 post KB2663685 and Windows Server 2012 have changed the default behavior of DFSR.  Auto-recovery of DFSR replicated folders after unexpected shutdown is now disabled.  In other words, if a computer that hosts a DFSR replicated folder experiences an unexpected shutdown, DFSR will not automatically resume upon reboot.  (This includes Sysvol!

On older versions of Windows, DFSR auto-recovery was enabled.  I'm sure the reason for this change involves auto-recovery leading to unexpected rollbacks and unauthoritative conflict resolutions between replication partners, especially in wide-spread domains with high end-to-end replication latency and frequent changes... but even though the news was published, I for one didn’t pay enough attention to it and it has a very real effect on the way we manage our Windows systems that utilize DFSR going forward. 

So what if a domain controller or a file server with a DFSR share on it running 2008R2 or 2012 crashes unexpectedly, leaving the DFS database and the NTFS USN journal out of sync?  Then Sysvol no longer receives updates on that DC.  The DFSR file share no longer receives updates on that file server.  It's up to you to manually restart replication, and to resolve any conflicts with replication partners if changes took place during the time that the crashed server wasn't replicating. 

Luckily that is easy to do, and it's also possible to set the behavior back to auto-recovery if that is what you wish. 

How will I know if this effects my server?

While this first example is just a symptom of the problem, here is how it first came to my attention, triggering the investigation:

(Click on images for a better view.)

Errors in my event log

Application of Group Policy was failing, but only on DC02 and servers which were using DC02 as a domain controller.  Not DC01 or any server logged on by DC01.  As it turns out, the GPO referenced by that error event, a new GPO that I had just created on DC01, didn’t exist on DC02, hence the errors.  Sysvol did not seem to be replicating anymore.

Here is the actual event log event to let you know that DFS Replication has stopped on one or more volumes:

DFSR Error

Luckily, starting replication back up again is easy and the command to do it with your actual GUID, is right there in the event:

wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid="12345678-ABCD-1234-EFGH-1A2B3C4E5F" call ResumeReplication

 You can also turn auto-recovery back on with wmic or by modifying the registry if you don’t have the time to be bothered by this:

HKLM\System\CurrentControlSet\Services\DFSR\Parameters\StopReplicationOnAutoRecovery = 0

Just be aware that auto-recovery can lead to unwanted rollbacks of DFSR data in some circumstances.