Metal Whiskers? In *My* Datacenter?

Metal Whiskers*This image copyright NASA*

Metal whiskers, also referred to as tin whiskers or zinc whiskers, are something that I've read about with some curiosity before. Science does not currently fully understand why many metals and alloys form these tiny whiskers over time. The phenomenon has been known since the early 20th century, but we still don't know much about why it happens or how to effectively stop it.  It is still being studied today, and you can find metal whiskers in the news being blamed for things like fires aboard aircraft.  Obviously, these whiskers can wreak havoc in an electronic system whose components are packed tightly together. These whiskers can grow out of the solder used to manufacture electronic equipment, and they can also grow out of other non-electrified pieces of metal like server rack rails and the metal parts of datacenter raised flooring tiles. These tiny little metal whiskers can then be shaken loose or scraped off by such actions as lifting the floor tile and sliding it across the surface of an adjacent tile, then blown into the air by the datacenter ventilation system, and then subsequently sucked into the power supplies of the computers housed within the datacenter.  Resulting short-circuits can cause electronic component failure, and even fire.

This question was asked today by someone on ServerFault, which rekindled my interest in the subject.  I also recommend reading the Wikipedia on it.  And I highly recommend visiting this NASA page - an entire page devoted to the phenomenon of metal whiskers.

From that page, if you just watch or read one thing from it, I specifically recommend this video, which is specifically about the damage metal whiskers can do in a datacenter environment.

The "CPU Steal Time" Metric in Unix/Linux Virtual Machines and a Windows Counterpart

I haven't posted in a while; been busy both studying for Windows Server 2012 stuff and also preparing for a possible slight career shift.  But I do want to put this up here, because it's one of my answers to a Serverfault  question that I'm a little proud of.  Nevertheless, it's a deep enough topic that I expect someone who knows more about it than me to come along and correct me.  Which I welcome.  That's how science works.  I'm not learning if I'm not wrong.

Here was the question:

In order to assess performance monitoring accuracy on virtualization platforms, the CPU steal time has become an increasingly relevant metric - see EC2 monitoring: the case of stolen CPU for an instructive summary in the context of Amazon EC2 and IBM's paper on CPU time accounting for a more in-depth technical explanation (including illustrations) of the concept:

Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.

Accordingly, it is exposed in most related Unix/Linux monitoring tools nowadays - see e.g. columns %steal or st in sar or top:

st -- Steal Time
The amount of CPU 'stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).

I've been unable to figure out how to capture the same metric on Windows though, is this possible already? (Ideally for the Windows 2008 Server R2 AMIs on EC2 and via a respective Windows Performance Counters of course.)

 And here was my answer:

Let me preface by saying that I am coming from the point of view of Hyper-V as a virtualization platform because that is where I have the most experience. Even though there may be certain tenets of virtualization, as we know it, that cannot be deviated from, Microsoft and VMware and Xen all have different strategies for how they design their hypervisors.

That's the first thing that makes your question challenging. You pose your question as if it were hypervisor-agnostic, when in truth it is not. Amazon EC2, for example, uses the Xen hypervisor, and the "CPU Steal Time" metric that you see in the output of a top command issued from within a Linux VM running on that hypervisor is a result of the integration services installed on that guest OS (or virtualization-aware tools on the guest) in conjunction with data provided by that specific hypervisor.

First off let me just answer your question straight up: There is no way to see from inside a virtual machine running Windows how much time the processors belonging to the physical machine on which the hypervisor runs spends doing other things, unless the particular virtual tools/services or virtualization-aware tools for your particular hypervisor are installed in the guest VM and the particular hypervisor on which the guest is running exposes that data. Even a Windows guest running on a Hyper-V hypervisor will not have immediate access to information regarding the time spent that the physical processors on the hypervisor were doing other things. (To quote voretaq7, something that "breaks the fourth wall.") Even though Windows client and server operating systems running as virtualized guests in Hyper-V with the correct integration services/tools installed make use of "enlightenments" (which are literally kernel code alterations made especially for VMs) that significantly increase their performance in using the resources of a physical host, the bottom line is that the hypervisor does not have to give any more information to the guest OS than it wants to. That means the hypervisor does not have to tell a guest VM what else it is doing besides servicing that VM... unless it wants to. And that information about what else the physical processors are doing is necessary for deriving a metric from the perspective of the VM such as "CPU Steal Time: the percentage of time the vCPU waits for a physical CPU."

How could the guest OS know that, if it didn't even realize that it was actually virtualized? It's like The Truman Show... for computers.

In other words, without the right integration tools installed on the guest, the guest OS won't even know that its CPU is actually a *v*CPU. It won't even know that there is another force outside of itself "stealing" CPU cycles from it, therefore that metric will not exist on the guest VM.

That's why I don't even like the phrase "CPU Steal Time." The word steal just puts everybody in the wrong frame of mind from the get-go.

A hypervisor such as Hyper-V does not give guests direct access to physical resources such as physical processors or processor cores. Instead the hypervisor gives them vDevs - virtual devices - such as vCPUs.

A prime example of why: Say a virtual machine guest OS makes the call to flush the TLB (translation look-aside buffer) which is a physical component of a physical CPU. If the guest OS was allowed to clear the entire TLB on a physical processor, that would have negative performance effects for all the other VMs that were also sharing that same physical TLB. In the case of Windows, that call in the guest OS is translated into a "hypercall" or "enlightened" call which is interpreted by the hypervisor so that only the section of the TLB that is relevant to that virtual machine is flushed.


(Interestingly, that hints to me that guest VMs that do not have the proper integration tools and/or services could have the ability to impact the performance of all the other VMs on the same host, but that is completely outside the scope of this topic.)


All that to say that you can still detect in a Hyper-V host the time that a virtual processor spent waiting for a real processor to become available so that it could scheduled to run. But you can only see that data on a Windows Hyper-V hypervisor. If it is possible to see this in other hypervisors, I urge others to tell us how to see this in that hypervisor and also if it is exposed to the guests. And that is before we even get to whether that data is exposed to the guest OS or not.

My test machine was Hyper-V Server 2012, which is the free edition of Server 2012 that only runs Core and the Hyper-V role. It's effectively the same as any Windows Server 2012 running Hyper-V.

Fire up Perfmon on your parent partition, aka physical host. Load this counter:

Hyper-V Hypervisor Virtual Processor\CPU Wait Time Per Dispatch\* 

You will notice that there will be an instance of that counter for each virtual machine on that hypervisor, as well as _Total. The Microsoft definition of that Perfmon counter is:

The average time (in nanoseconds) spent waiting for a virtual processor to be dispatched onto a logical processor.

Obviously, you want that number to be as low as possible. For computers, waiting is almost never a good thing.

Other performance counters on the hypervisor that you will want to investigate are Hyper-V Hypervisor Root Virtual Processor\% Guest Run Time, % Hypervisor Run Time, and % Total Run Time. These counters provide you with the percentages that could be used to determine facts such as how much time the "real" processors spend doing things other than servicing a VM or all VMs.

So in conclusion, the metric that you are looking for in a guest virtual machine depends on the hypervisor that it is running on, whether that hypervisor chooses to provide the data about how it spends its time other than servicing that VM, and if the guest OS has the right virtualization integration tools/services/drivers to be aware enough to realize that the hypervisor is making that data available.

I know of no way on a Windows guest, integration tools installed or not, to see how much time, in terms of seconds or percentage, that VM's host has spent servicing it or not servicing it respective to the total physical processor time.

The Logitech G9x Mouse

g9x

I need to write a new post - it's been too long!

So I got a new mouse a few weeks ago. My old trusty Basic Microsoft Optical mouse was still chugging along just fine after a couple years of abuse, but the buttons were getting a little loose. So I decided it was time for an upgrade. After a little research, I settled on the Logitech G9x Laser Gaming mouse. Now, along with my Das Keyboard, I have an embarrassing amount of money invested into my input devices. I'll try to hit on most of the pros and cons.

I have to say, I'm pretty impressed with it. It has two interchangeable shells to better fit your hand. I've been using the fatter of the two. I feel like I have too much of a "claw" grip on the mouse when I use the smaller shell, which ends up tiring my hand, but on the other hand (no pun intended,) the left outer edge of the bigger shell sort of scrapes on my mousepad as I pick up the mouse at an angle to drag it back over to the left. (You know, when you've veered too far to the right on your mouse pad and you need to pick the mouse up and bring it back to center.) The effect is not terrible, but it's something that my old mouse didn't do. I just need to train myself to pick my mouse up at a flatter angle when I need to move it around the mousepad.

The two buttons on the side are something else I was not accustomed to, but I've already grown fond of them. I always used to steer clear of any mouse that had extra buttons. I used to just want a left button, a right button, and a mousewheel and that's it. I just knew that I would always be accidentally hitting any extra buttons. Well, it just takes a little getting used to and you learn to like them. I very rarely hit them on accident. They are amazingly handy for web browsing, as I use that back button a lot. I already can't believe that I used to drag my pointer up to the browser's back button every time I wanted to move back a page. That said, going back in my web browser is pretty much all I've used the side buttons for so far.

The scroll wheel is probably my least favorite thing about this mouse. It does have a hardware toggle button on the bottom of the mouse for if you want the wheel to scroll smoothly, or if you want that "bump bump bump" feeling as you scroll it. Well, I'm one of those people that definitely needs the bumps/tactile feedback, so you click this little "microgear" button on the bottom of the mouse and there it is. However, the middle mouse button (pushing down on the mouse wheel) is extremely difficult. In fact it takes so much force to depress the middle mouse button, that you can pretty much forget about precisely pointing at something small on the screen and middle-clicking it without the wheel or the entire mouse moving, or both. I don't know if disengaging the microgear would make that any better. It doesn't matter, because I can't have my scroll wheel being all loosey goosey anyway. Maybe it'll loosen up over time, as my keyboard has. Also, the wheel has left and right play too... see those little arrows on the sides of the mousewheel? Yeah, that sucks. Ironically it's not hard at all to accidentally actuate the left or right action. I have no use for that. I do accidentally hit those all the time, which interrupts the middle-click scroll if you're one of those people that like to middle-click on a page and then pull the pointer down to scroll down the page. Luckily, with the configuration software I was able to map those left and right actions to do nothing... which leads me to:

The software. I'm a minimalist, at least when it comes to my computer. I'm one of those people who almost never has icons on their desktop. That means I do not want to install more software on my computer and have another useless system tray icon sitting down there just to be able to configure my bloody mouse. However, there are some things that you can only do with that proprietary software, such as change the LED colors (I changed mine from red to blue to match my keyboard and monitor,) and re-map all those non-standard buttons. But luckily, all your configuration changes are saved inside the mouse, so once you've got it set up how you like it, you can uninstall the software for good. Even unplugging the mouse and using it on a different computer doesn't reset the custom settings. Furthermore, if you have internet access, the mouse's basic functionality is plug and play as Windows can automatically download a G9x driver from Windows Update.

The mouse also comes with this tin full of 4-gram and 7-gram weights. There is a slide-out tray inside the mouse that you can actually fit these weights into in various configurations to precisely give your mouse the weight that you desire. Now maybe I'm just not pro enough to really realize the benefit of this, but it just doesn't really make much difference to me. My hand seems to be able to adjust just fine to whatever weight the mouse is. However, I could see how weights could mitigate that overcompensation you get in games when you try to react quickly.  The jury is still out on this feature.

Now, I want to talk about the thing that really makes the mouse awesome. It's the DPI switch just under the left mouse button. Having the ability to increase and decrease the sensitivity on the fly has basically added a whole new dimension for me, particularly in certain games.  Imagine a game of Battlefield 3 where you can slow your mouse down to the precision of a surgeon's scalpel when scoped in, and then bring it back up to speed when you zoom out and go back to running around, all without ever taking your eyes off the game? It's pretty amazing. You may not all agree on all the design decisions employed by this mouse, but all mouse manufacturers should take note of this great feature.

Cisco UCS

Let's talk about Cisco UCS - Unified Computing System.

I help stand up new IT infrastructure all over the world, and I have been seeing a lot more of these lately. It's a pretty impressive system. In most small to mid-size shops you tend to see an onsite server closet or maybe a small cage in a datacenter full of 2 or 3 generations old HP Proliants and Dell Poweredges. But for the largest scale enterprise operations, nothing beats the density and manageability of blades. (And their ability to lock you in with a particular vendor. ;)) Blade systems essentially do for hardware what hypervisors did for operating systems. Not only are you packing more into less and increasing your compute density, you're centralizing the management of your entire datacenter and simplifying the deployment process by orders of magnitude. What do I mean by that? Well have some pictures worth a thousand words:

(Try right-clicking the images and opening in a new tab and you might get a better view.)

Turn this...ucs

... into this.

(The above image is courtesy of dalgeek - knightfoo.wordpress.com.)

Turn this...ucs wiring

... into this.

(I took that picture on the left myself a couple years ago, from a place I used to work.)

Now, when we talk about Cisco UCS, we're actually talking about a few discrete components that come together to form the UCS. First, we have the fabric interconnect. We'll use the Cisco UCS 6120XP 20-Port Fabric Interconnect as an example.

6120xp

It's a specialized 1U 10Gb (ten gigabit) switch that supports up to 160 servers or 20 chassis as a single, seamless system. (And remember each "server" can have dozens of VMs on it.) This particular switch is capable of 520Gbps of throughput. (I keep feeling like I'm making typos when I type numbers that large.)

The next piece is the blade chassis itself. Take the Cisco UCS 5108 Blade Server Chassis for example. This thing is 6 rack units, making the entire solution so far 7U for what could potentially house hundreds of VMs. Those smaller ports on the bottom of the chassis are for power supplies. Note that you can cram either half-width blades or full-width blades into this chassis. A full-width blade would look a little more like the traditional pizza box that we're used to, and has room for more stuff in it obviously, but I think the extra agility offered by half-width blades is probably the reason that they're the only ones I really see out in the wild.

Cisco UCS 5100 series

And lastly we have the blades themselves. Take the Cisco UCS B200 M2, a half-width blade, for example:

UCS Guts *Why yes, that is 192GB of DDR3 RAM, thanks for noticing*

And here's a little artist's depiction of what an entirely fleshed out "Unified Computing System" would look like. Note that you'd probably want some SAN storage somewhere for this to be considered a complete solution, beyond just the couple of disks that you can stick into each blade. I wonder how much storage you could get up there in the top 4 to 6 U of each cabinet...

racks of ucses

Hardware characteristics such as MAC addresses are configured at the chassis slot level, so if a blade fails you can swap in a new blade and not have to reconfigure anything. You can also do things like automatically reboot a host onto another blade if one fails, etc.

And lastly - it is all managed from a single web interface. (I hope you like Java.)

So that all looks pretty amazing, right? There may be a couple of cons to going with Cisco UCS however, and there are alternative blade systems to consider as well. You just have to weigh these pros and cons for yourself and your enterprise's situation. One of these possible cons is cost. The old adage goes that nobody in IT ever got in trouble for buying Cisco. They do make great stuff, but they also make practically the most expensive equipment in existence. Exact pricing is complicated and of course depends on exactly how you configure your equipment, but list price is somewhere in the ballpark of $20,000 per blade. Don't worry though, no one pays list price. Especially if you were to make a huge order like this, Cisco would be expected to have their discount pen at the ready. $10,000 - $12,000 per blade might be a more realistic figure. I count 288 blades in the picture above, putting your budgetary needs at somewhere around $2.9 - $3.46 million USD. (And we still don't have storage or networking yet... but you are well on your way to having one of the densest datacenters in the world.)

In the types of environments that I'm most used to, I see one to two UCS chassis per datacenter, each with two clustered interconnects for redundancy. In contrast, you might decide to go with HP c7000 blade servers filled with BL465c's. I see some of these as well, especially as things like cloud technologies cause aggressive IT expansion and thus the need to do more with your budget. You would almost certainly save a substantial amount of cash if you did go with HP or Dell; however, I think Cisco still has a very compelling price-per-blade as the solution scales out extremely well, and you only pay for the management stack once. (Or twice if you're really reaching for the stars like we did above.)

So in conclusion, I'll just leave you with a couple last things. Here is Cisco's UCS In A Nutshell documentation if I've whet your appetite and you want more information. And here is a Cisco UCS emulator, if you'd like to play around with what it feels like to administer one of these things. And lastly, here are some tutorials to go along with that emulator software.

'Till next time!

Das Keyboard

I've been using computers for a long time, but I never really put any thought into the keyboard I used. I had never used a keyboard that was worth more than about $10, because what's the point, right?

Well I was hanging out in the ##/r/sysadmin IRC chatroom when one of the keys on my latest $10 keyboard started sticking, and it occurred to me that instead of just turning the keyboard upside down and shaking it, hoping it was just a stray Dorito flake stuck in there again, it might be time to finally get a serious keyboard. Luckily there was no shortage of opinions on "good" keyboards there in the chatroom, and one in particular really caught my eye: The DasKeyboard.

It's a mechanical keyboard, and I had heard the buzz about mechanical keyboards - that the tactile feel, durability and key actuation was unparalleled, but on the other hand they were loud. (*CLACKCLACKCLACK*) So with that in mind, I ended up buying the Das Keyboard Model S Ultimate Silent, understanding that it was slightly silent-er than the regular version, but was still pretty loud.

UPS dropped it off a couple days later. I've been using it for about a week now. So without any further ado, let me just get down to exactly what I really love and don't love about this keyboard:

Pros:

  • Very heavy. This keyboard will stay where you put it. Feels solid and high quality. The keys don't rattle around in their sockets.
  • The blue LEDs for caps lock, num lock, and scroll lock are pleasantly dim and not distractingly bright. Really bright LEDs annoy me.
  • No key inscriptions. You might think that it is too hipster and/or nerdy to have a blank keyboard. But I'm truly urging you to try it. Even if you've been using PCs for 30 years, you don't realize how often you unknowingly glance at the keyboard as a subconscious crutch, and having a blank keyboard will increase your touch-typing skill and typing speed. (Can you reliably hit the % character or { character just by touch? Are you confident enough to hit the & character without looking and losing your stride mid-sentence? You will be after about a week of where looking doesn't help.) Plus it'll intimidate people to stay off your computer if they don't know what they're doing.
  • Built-in USB ports so you can plug in another USB keyboard into your keyboard once you realize that you can't type without letters on the keys.
  • Long, thick USB cord, with additional plug for a USB hub.
  • Feels great to type on. The keys are lightly textured and they spring back extremely quickly after you press them - which is what makes extremely fast typing so possible on this keyboard. There's no mushiness or uncertainty in your key presses. There is nothing like a mechanical keyboard for both fast typing and video games. Those Korean kids that play Starcraft tournaments and register 42 billion keystrokes per minute? Mechanical keyboards is how they do that.
  • The company is based in my home state of Texas!
  • N-key rollover (push as many keys at once as you want, the computer will register them all. Only mechanical keyboards can do this.)
  • Tactile feedback and not having to completely depress each key to get it to register. This means you can move on to the next key faster.
Cons:
  • It's loud. Even though the "silent" version is slightly less so, it's still the loudest keyboard I've ever owned. That said, you may grow fond of the satisfying sound of typing. If I were a film director, I would use this keyboard as a prop for when I want the audience to know that someone is typing.
  • It's expensive. However, take into account that this extremely heavy-duty keyboard will last for years, and is arguably the best keyboard that money can buy.
  • I am experiencing a little bit of stickiness on my left shift key when I strike it near its right edge with my left pinky finger. If I strike it more towards the center of the key it works perfectly.  I've only used the keyboard for about a week though, so I'm hoping that the left shift key either smooths out a little over time, or I subconsciously train my pinky finger to stretch out a little further during typing to strike the shift key more towards its center.  The backspace key is also a bit squeaky.  The rest of the keys are perfect
So all in all, I am quite pleased with my purchase.  This is by far the best keyboard I've ever laid hands on. The transition to no letters on the keys was much smoother than I thought and I never once found myself missing them. Furthermore, my typing speed truly has increased already.
 
One last interesting thing is that when using this keyboard, it tricks my mind via optical illusion, as sometimes I could swear I can see faint letter inscriptions on the keys because my mind is expecting to see letters on the keys. Spooky!
 
edit: I am happy to report that the left shift key did in fact smooth out over time with a couple weeks of use.  In fact, all of the keys are getting a little smoother and becoming more perfectly tuned to my fingers. My love of this keyboard is still growing.