Can someone optimize my .net RegEx for Powershell - parsing a table with errors

Solution 1:

This may not be what you are looking for, but you could do something like the following to output an array of custom objects:

$output = switch -regex ($requestdata.content -split '\r?\n') {
    '^##\s' {
        # tracking empty lines since there is one under the service title
        # start new hash table when a new service is found
        # remove ## from service title names
        $emptyLineCount = 0
        $hash = [ordered]@{}
        $hash.ServiceTitle = $_ -replace '^##\s'
    }
    '\| \*\*' {
        # split on | and surrounding spaces
        # replace ** so name is cleaner
        if ($hash.ServiceTitle) {
            $key,$value = ($_ -split '\s*\|\s*' -replace '\*\*')[1,2]
            $hash[$key] = $value
        }
    }
    '^$' {
        # when second empty line is reached in a service block, output object
        if ($hash.ServiceTitle -and ++$emptyLineCount -eq 2) {
            [pscustomobject]$hash
        }
    }
}

# Finding a service by title
$output | Where ServiceTitle -eq 'CNG Key Isolation'

Splitting the contents makes an array of lines, which is easier for me to use switch statement.


Using a purer regex solution will make things more brittle if there are data inconsistencies. The data block for CNG Key Isolation is missing the | at the end of each line and is the only one like that. So now you have to match that special case or fix the data.

$fields = "ServiceTitle","ServiceName","Description","Installation","StartupType","Recommendation","Comments"
$RequestData = Invoke-WebRequest -UseBasicParsing -Uri https://raw.githubusercontent.com/MicrosoftDocs/windowsserverdocs/main/WindowsServerDocs/security/windows-services/security-guidelines-for-disabling-system-services-in-windows-server.md
$regexString = '(?m)^##\s(?<ServiceTitle>.*)$(?s).*?\*\*Service name\*\* \| (?<ServiceName>.*?(?=\s+\|)).*?\*\*Description\*\* \| (?<Description>.*?(?=\s+\|)).*?\*\*Installation\*\* \| (?<Installation>.*?(?=\s+\|)).*?\*\*Startup type\*\* \| (?<StartupType>.*?(?=\s+\|)).*?\*\*Recommendation\*\* \| (?<Recommendation>.*?(?=\s+\|)).*?\*\*Comments\*\* \| (?<Comments>.*?(?=\s+\|))'
$out = $RequestData.Content |
    Select-String -Pattern $regexString -AllMatches |
        Foreach-Object { $_.Matches | Foreach-Object {
            $hash = [ordered]@{}
            foreach ($field in $fields) {
                $hash.$field = $_.Groups.where{$_.Name -eq $field}.Value}
                [pscustomobject]$hash
            }
        }

Solution 2:

Assuming you have all that text in your $RequestData.content, then I wouldn't try to create one large regex to parse it all out into usable objects, but instead would do:

# first split the tables from the rest of the text and work on the table lines only
$result = ($RequestData.content -split '(?m)^The following tables.*:')[-1].Trim() -split '(?m)^## ' | 
    Where-Object { $_ -match '\S' } |
    ForEach-Object {
        # split each block to parse out the title and the table data
        $title, $table = ($_.Trim() -split '(\r?\n){2}', 2).Trim()
        # now remove the markdown stuff from the data and convert it using ConvertFrom-Csv
        $data = (($table -replace '(?m)^\|--\|--\||[*]{2}|^\||\|$' -replace '\s\|\s', '|') -split '\r?\n' -ne '').Trim()  | ConvertFrom-Csv -Delimiter '|'
        # set up an ordered Hashtable to store the data
        $hash = [ordered]@{ServiceTitle = $title}
        foreach ($item in $data) {
            $hash[$item.Name] = $item.Description
        }
        # output real objects
        [PsCustomObject]$hash
    }

$result