Using Powershell to output characters (not lines) after a match in a large file

I use powershell to parse huge files and easily take a look at a small part of the file where a certain string occurs.. like this: Select-String P120300420059211107104259.txt -Pattern "<ID>9671510841" -Context 0,300

This gives me 300 lines of the file after the occurance of that ID number.

But I've come across a file that has no carriage returns. Now I would like to do the same thing, but instead of lines being returned, I guess I need characters. How would I do this? I have never created scripts in powershell - just ran simple commands like the above.

I would like to see maybe 1000 characters after the matched string, within a huge file. THanks!


Solution 1:

The problem with using Select-String or [Regex]::Matches() (or -match) to test for the presence of a substring in a single-line file is that you first need to read the whole file into memory at once.

The good news is that you don't need regular expressions to find a substring in a huge single-line text file - instead, you can read the file contents into memory in smaller chunks and then search through those - this way you don't need to store the entire file in memory at once.

Reading buffered text from a file is fairly straightforward:

  • Open a readable file stream
  • Create a StreamReader to read from the file stream
  • Start reading!

Then you just need to check whether:

  • The target substring is found in each chunk, or
  • The start of the target substring is partially found at the tail end of the current chunk

And then repeat until you find the substring, at which point you read the following 1000 characters.

Here's an example of how you could implement it as script function (I've tried to explain the code in more detail in inline comments):

function Find-SubstringWithPostContext {
  [CmdletBinding(DefaultParameterSetName = 'wp')]
  param(
    [Alias('PSPath')]
    [Parameter(Mandatory = $true, ParameterSetName = 'lp', ValueFromPipelineByPropertyName = $true, ValueFromPipeline = $true)]
    [string[]]$LiteralPath,
  
    [Parameter(Mandatory = $true, ParameterSetName = 'wp', Position = 0)]
    [string[]]$Path,
  
    [Parameter(Mandatory = $true)]
    [ValidateLength(1, 5000)]
    [string]$Substring,

    [ValidateRange(2, 25000)]
    [int]$PostContext = 1000,

    [switch]$All,

    [System.Text.Encoding]
    $Encoding
  )

  begin {
    # start by ensuring we'll be using a buffer that's at least 4 larger than the 
    # target substring to avoid too many tail searches
    $bufferSize = 2000
    while ($Substring.Length -gt $bufferSize / 4) {
      $bufferSize *= 2
    }
    $buffer = [char[]]::new($bufferSize)
  }

  process {
    if ($PSCmdlet.ParameterSetName -eq 'wp') {
      # resolve input paths if necessary
      $LiteralPath = $Path | Convert-Path
    }
    
    :fileLoop
    foreach ($lp in $LiteralPath) {
      $file = Get-Item -LiteralPath $lp

      # skip directories
      if ($file -isnot [System.IO.FileInfo]) { continue }
        
      try {
        $fileStream = $file.OpenRead()
        $scanner = [System.IO.StreamReader]::new($fileStream, $true)
        do {
          # remember the current offset in the file, we'll need this later
          $baseOffset = $fileStream.Position

          # read a chunk from the file, convert to string
          $readCount = $scanner.ReadBlock($buffer, 0, $bufferSize)
          $string = [string]::new($buffer, 0, $readCount)
          $eof = $readCount -lt $bufferSize

          # test if target substring is found in the chunk we just read
          $indexOfTarget = $string.IndexOf($Substring)
          if ($indexOfTarget -ge 0) {
            Write-Verbose "Substring found in chunk at local index ${indexOfTarget}"
            # we found a match, ensure we've read enough post-context ahead of the given index
            $tail = ''
            if ($string.Length - $indexOfTarget -lt $PostContext -and $readCount -eq $bufferSize) {
              # just like above, we read another chunk from the file and convert it to a proper string
              $tailBuffer = [char[]]::new($PostContext - ($string.Length - $indexOfTarget))
              $tailCount = $scanner.ReadBlock($tailBuffer, 0, $tailBuffer.Length)
              $tail = [string]::new($tailBuffer, 0, $tailCount)
            }

            # construct and output the full post-context
            $substringWithPostContext = $string.Substring($indexOfTarget) + $tail
            if($substringWithPostContext.Length -gt $PostContext){
              $substringWithPostContext = $substringWithPostContext.Remove($PostContext)
            }
            
            Write-Verbose "Writing output object ..."
            Write-Output $([PSCustomObject]@{
              FilePath = $file.FullName
              Offset = $baseOffset + $indexOfTarget
              Value = $substringWithPostContext
            })

            if (-not $All) {
              # no need to search this file any further unless `-All` was specified
              continue fileLoop
            }
            else {
              # rewind to position after this match before next iteration
              $rewindOffset = $indexOfTarget - $readCount
              $null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
            }
          }
          else {
            # target was not found, but we may have "clipped" it in half, 
            # so figure out if target string could start at the end of current string chunk
            for ($i = $string.Length - $target.Length; $i -lt $string.Length; $i++) {
              # if the first character of the target substring isn't found then 
              # we might as well skip it immediately
              if ($string[$i] -ne $target[0]) { continue }

              if ($target.StartsWith($string.Substring($i))) {
                # rewind file stream to this position so it'll get re-tested on 
                # the next iteration, then break out of tail search
                $rewindOffset = $i - $string.Length
                $null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
                break
              }
            }
          }
        } until ($eof)
      }
      finally {
        # remember to clean up after searching each file
        $scanner, $fileStream |Where-Object { $_ -is [System.IDisposable] } |ForEach-Object Dispose
      }
    }
  }
}

Now you can extract exactly 1000 characters after a substring is found with minimal memory allocation:

Get-ChildItem P*.txt |Find-SubstringWithPostContext -Substring '<ID>9671510841'