Mystery of the Performance Counters with Negative Denominators!

So I finally solved a bit of a mystery, after wondering about it off an on for about 6 months.

A lot of us out there use some sort of monitoring program like SCOM that gathers some performance counter data from all of the servers in your environment, and uses that data to produce messages such as:

WARNING: Hey man, the CPU usage on SERVER01 is really high and has been for a while, so, you might wanna' go check that out.

But every once in a while, if for some reason the performance counters needed are unavailable, your monitoring application might say something like this:

ERROR: Unable to get performance counter data for SERVER01.

Well that happens to me somewhat frequently. Sometimes it's an error in the monitoring application, or sometimes the agent machine needs a lodctr /R or winmgmt /resync or something like that. But every once in a while it's more difficult to solve than that. I logged on to SERVER01 to see if I could simply add the performance counters needed to Perfmon. The objects and instances were there, but one of them looked odd. Then I decided to use typeperf.exe to get some data from the counters:

 

perfmon error

 

typeperf error

A counter with a negative denominator value was detected? Now contrary to that screenshot above, the -1's were intermittent, and would actually change to normal values for a few seconds, and then go back to negative ones. It was as if the scale of that counter was offset, and whenever real CPU load would occur, the counter would increase to positive values, but then as the CPU usage dropped back down closer to zero, the counter would fall below zero again.  Hmm... if you do a search for something like "missing performance counters," or "a counter with a negative denominator value," you'll get plenty of good links, as missing or corrupt performance counters are a somewhat common issue. Such as this, or this. Notice in that last link, that there is a vague reference to it being caused by "intermittent timing issues in the performance data handlers for many counters." The article fails however to offer any advice or anything more concrete than that. lodctr /R and winmgmt /resyncperf and all that usual stuff did nothing to help.

So I sat back and thought some more about the scope of the problem. I have only seen this issue two or three times in my career. The two cases I could remember both seemed to be on VMware virtual machines. One of them was a Windows 2003 operating system, and then a couple months later I saw it again on a Windows 2008 R2 VM. That phrase from that one KB stuck with me... "intermittent timing issues," and that made me think of hardware timing issues. But since these were VMs, they didn't exactly have hardware per se... but maybe it had something to do with the way in which the virtual machines communicate with their physical host. I was so close to figuring out the answer, but at the time I shrugged it off and shelved the issue in my mind for a while.

Then about a week ago the issue popped up again. Long story short, it turns out it was this all along:

VMware tools

Unchecking that "Time synchronization between the virtual machine and the ESX server" in the VMware tools app on the VM made the problem go away. Ugh. The answer had been so simple all along, yet it alluded me for months.

I feel better now that I at least have a solution for that particular scenario.  Now... back to Hyper-V.

Comments (4) -

We ran into this exact same problem.  I would have never come to this conclusion without this article.   Thanks so much

Awesome, I'm glad to help!  And thank you for commenting. Smile

Please how do i access this properties dialogue box? Thanks

First thank you, since 85% or more of our environment is virtual this will no doubt be helpful in the future, however I'm seeing the same error on a physical machine with no VM Interaction whatsoever. Any clues?

Comments are closed