Can someone optimize my .net RegEx for Powershell - parsing a table with errors
Solution 1:
This may not be what you are looking for, but you could do something like the following to output an array of custom objects:
$output = switch -regex ($requestdata.content -split '\r?\n') {
'^##\s' {
# tracking empty lines since there is one under the service title
# start new hash table when a new service is found
# remove ## from service title names
$emptyLineCount = 0
$hash = [ordered]@{}
$hash.ServiceTitle = $_ -replace '^##\s'
}
'\| \*\*' {
# split on | and surrounding spaces
# replace ** so name is cleaner
if ($hash.ServiceTitle) {
$key,$value = ($_ -split '\s*\|\s*' -replace '\*\*')[1,2]
$hash[$key] = $value
}
}
'^$' {
# when second empty line is reached in a service block, output object
if ($hash.ServiceTitle -and ++$emptyLineCount -eq 2) {
[pscustomobject]$hash
}
}
}
# Finding a service by title
$output | Where ServiceTitle -eq 'CNG Key Isolation'
Splitting the contents makes an array of lines, which is easier for me to use switch
statement.
Using a purer regex solution will make things more brittle if there are data inconsistencies. The data block for CNG Key Isolation is missing the |
at the end of each line and is the only one like that. So now you have to match that special case or fix the data.
$fields = "ServiceTitle","ServiceName","Description","Installation","StartupType","Recommendation","Comments"
$RequestData = Invoke-WebRequest -UseBasicParsing -Uri https://raw.githubusercontent.com/MicrosoftDocs/windowsserverdocs/main/WindowsServerDocs/security/windows-services/security-guidelines-for-disabling-system-services-in-windows-server.md
$regexString = '(?m)^##\s(?<ServiceTitle>.*)$(?s).*?\*\*Service name\*\* \| (?<ServiceName>.*?(?=\s+\|)).*?\*\*Description\*\* \| (?<Description>.*?(?=\s+\|)).*?\*\*Installation\*\* \| (?<Installation>.*?(?=\s+\|)).*?\*\*Startup type\*\* \| (?<StartupType>.*?(?=\s+\|)).*?\*\*Recommendation\*\* \| (?<Recommendation>.*?(?=\s+\|)).*?\*\*Comments\*\* \| (?<Comments>.*?(?=\s+\|))'
$out = $RequestData.Content |
Select-String -Pattern $regexString -AllMatches |
Foreach-Object { $_.Matches | Foreach-Object {
$hash = [ordered]@{}
foreach ($field in $fields) {
$hash.$field = $_.Groups.where{$_.Name -eq $field}.Value}
[pscustomobject]$hash
}
}
Solution 2:
Assuming you have all that text in your $RequestData.content
, then I wouldn't try to create one large regex to parse it all out into usable objects, but instead would do:
# first split the tables from the rest of the text and work on the table lines only
$result = ($RequestData.content -split '(?m)^The following tables.*:')[-1].Trim() -split '(?m)^## ' |
Where-Object { $_ -match '\S' } |
ForEach-Object {
# split each block to parse out the title and the table data
$title, $table = ($_.Trim() -split '(\r?\n){2}', 2).Trim()
# now remove the markdown stuff from the data and convert it using ConvertFrom-Csv
$data = (($table -replace '(?m)^\|--\|--\||[*]{2}|^\||\|$' -replace '\s\|\s', '|') -split '\r?\n' -ne '').Trim() | ConvertFrom-Csv -Delimiter '|'
# set up an ordered Hashtable to store the data
$hash = [ordered]@{ServiceTitle = $title}
foreach ($item in $data) {
$hash[$item.Name] = $item.Description
}
# output real objects
[PsCustomObject]$hash
}
$result