New-DepthGaugeFile.ps1: The Powershell Pipeline Is Neat, But It's Also Slow

If you know me or this blog at all, you know that first and foremost, I think Powershell is awesome.  It is essential to any Windows system administrator or I.T. pro's success.  The good news is that there are a dozen ways to accomplish any given task in Powershell.  The bad news is that eleven of those twelve techniques are typically as slow as a three-legged tortoise swimming through a vat of cold Aunt Jemima.

This is not the first, or the second, blog post I've made about Powershell performance pitfalls.  One of the fundamental problems with super-high-level languages such as Powershell, C#, Java, etc., is that they take raw performance away from you and give you abstractions in return, and you then have to fight the language in order to get your performance back.

Today I ran across another such example.

I'm creating a program that reads from files.  I need to generate a file that has "markers" in it that tell me where I am within that file at a glance.  I'll call this a "depth gauge."  I figured Powershell would be a simple way to create such a file.  Here is an example of what I'm talking about:

Depth Gauge or Yardstick File

The idea being that I'd be able to tell my program "show me what's at byte 0xFFFF of file.txt," and I'd be able to easily visually verify the answer because of the byte markers in the text file.  The random characters after the byte markers are just gibberish to take up space.  In the above example, each line takes up exactly 64 bytes - 62 printable characters plus \r\n.  (In ASCII.)

I reach for Powershell whenever I want to whip up something in 5 minutes that accomplishes a simple task.  And voila:

Function New-DepthGaugeFile
          [ValidateScript({Test-Path $_ -IsValid})]
          [ValidateRange(20, [Int32]::MaxValue)]
    Set-StrictMode -Version Latest
    [Int64]$CurrentPosition = 0
    $ValidChars = @('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',
        New-Item -Path $FilePath -ItemType File -Force -ErrorAction Stop | Out-Null
        Write-Error $_
    [Int32]$BufferMaxLines = 64
    [Int32]$BufferMaxBytes = $BufferMaxLines * $BytesPerLine

    If ($DesiredSize % $BytesPerLine -NE 0)
        Write-Warning 'BytesPerLine is not a multiple of DesiredSize.'

    $LineBuffer = New-Object 'System.Collections.Generic.List[System.String]'
    While ($DesiredSize -GT $CurrentPosition)
        [System.Text.StringBuilder]$Line = New-Object System.Text.StringBuilder
        # 17 bytes
        [Void]$Line.Append($("{0:X16} " -F $CurrentPosition))
        # X bytes
        1..$($BytesPerLine - 19) | ForEach-Object { [Void]$Line.Append($(Get-Random -InputObject $ValidChars)) }        
        # +2 bytes (`r`n)        

        $CurrentPosition += $BytesPerLine

        # If we're getting close to the end of the file, we'll go line by line.
        If ($CurrentPosition -GE ($DesiredSize - $BufferMaxBytes))
            Add-Content $FilePath -Value $LineBuffer -Encoding Ascii

        # If the buffer's full, and we still have more than a full buffer's worth left to write, then dump the
        # full buffer into the file now.
        If (($LineBuffer.Count -GE $BufferMaxLines) -And ($CurrentPosition -LT ($DesiredSize - $BufferMaxBytes)))
            Add-Content $FilePath -Value $LineBuffer -Encoding Ascii

Now I can create a dummy file of any size and dimension with a command like this:

New-DepthGaugeFile -FilePath 'C:\testfolder1\largefile2.log' `
                   -DesiredSize 128KB -BytesPerLine 64

I thought I was being really clever by creating an internal "buffering" system, since I instinctively knew that performing a file write operation (Add-Content) on each and every loop iteration would slow me down.  I also knew from past experience that overuse of "array arithmetic" like $Array += $Element would slow me down because of the constant cloning and resizing of the array.  I also remembered that in .NET, strongly-typed lists are faster than ArrayLists because we want to avoid boxing and unboxing.

Despite all these little optimizations, here is the performance while writing a 1MB file:

Measure-Command { New-DepthGaugeFile -FilePath C:\testfolder1\largefile.log `
                                     -DesiredSize 1MB -BytesPerLine 128 }

TotalSeconds    : 103.8428624

Over 100 seconds to generate 1 megabyte of data.  I'm running on an SSD that is capable of copying hundreds of megabytes per second of data, so the storage is not the bottleneck.

To try to speed things up, I decided to focus on the line that appears to be doing the most amount of work:

1..$($BytesPerLine - 19) | ForEach-Object { 
             [Void]$Line.Append($(Get-Random -InputObject $ValidChars)) }

The idea behind this line of code is that we add some random characters to each line.  If we want each line to take up 64 characters, then we would add (64 - 19) characters to the line, because the byte marker at the beginning of the line, plus a space character, takes up 17 bytes.  Then then the newline and carriage return takes up 2 bytes.

My first instinct was that the Get-Random action was taking all the CPU cycles.  So I replaced it with static characters... and it made virtually no difference.  Maybe it's the pipe and Foreach-Object?  Let's change it to this:

For ($X = 0; $X -LT ($BytesPerLine - 19); $X++)
    [Void]$Line.Append($(Get-Random -InputObject $ValidChars))

And now the results:

Measure-Command { New-DepthGaugeFile -FilePath C:\testfolder1\largefile.log `
                                     -DesiredSize 1MB -BytesPerLine 128 }

TotalSeconds    : 61.0464638

Exact same output, almost twice as fast.

Lack of IT Content Volume I

Hey guys. No interesting and deeply technical documentation today. I've been pretty busy with work, and also going over Powershell material in preparation of teaching a Powershell Boot Camp at work. In the mean time though, here's something interesting I read today. It's a quote from Nikola Tesla written in the New York Times on the subject of Thomas Edison, the day after Edison died:

"He had no hobby, cared for no sort of amusement of any kind and lived in utter disregard of the most elementary rules of hygiene  ... His method was inefficient in the extreme, for an immense ground had to be covered to get anything at all unless blind chance intervened and, at first, I was almost a sorry witness of his doings, knowing that just a little theory and calculation would have saved him 90 percent of the labor. But he had a veritable contempt for book learning and mathematical knowledge, trusting himself entirely to his inventor's instinct and practical American sense."

Reading that instantly made me think of the more popular, and now more amusing quotes from Edison himself:

"Genius is one per cent inspiration and ninety-nine per cent perspiration."
"I have not failed. I've just found 10,000 ways that won't work."

An interesting perspective on the two diametric inventors, is it not?

Work smarter, not harder, folks.