Unable to completely parse XML in PowerShell

I have an XML file that I would like to parse through, and retrieve back specific information.

To make it easy to understand, here is a screenshot of what the XML file looks like:

enter image description here

I would like to parse through the XML and for each Item node, retrieve back the fields indicated in the screenshot. Each of the values retrieved need to be formatted per item node.

Finally, I would love to be able to specify a criteria to look for, and only retrieve that where found.

I have been trying, without luck. Here is what I have been able to come up with:

[xml]$MyXMLFile = gc 'X:\folder\my.xml'
$XMLItem = $MyXMLFile.PatchScan.Machine.Product.Item
$Patch = $XMLItem | Where-Object {$_.Class -eq 'Patch'}
$Patch.BulletinID
$Patch.PatchName
$Patch.Status

When I run the above code, it returns no results. However, for testing purposes only, I remove the Item portion. Now, I can get it working by modifying the code above.

I load the XML into an XML Object. Now I try traverse it down to product and it works perfectly:

PS> $xmlobj.PatchScan.Machine.Product | Select-Object -Property Name, SP

Name SP
---- --
Windows 10 Pro (x64) 1607
Internet Explorer 11 (x64) Gold
Windows Media Player 12.0 Gold
MDAC 6.3 (x64) Gold
.NET Framework 4.7 (x64) Gold
MSXML 3.0 SP11
MSXML 6.0 (x64) SP3
DirectX 9.0c Gold
Adobe Flash 23 Gold
VMware Tools x64 Gold
Microsoft Visual C++ 2008 SP1 Redistributable Gold
Microsoft Visual C++ 2008 SP1 Redistributable (x64) Gold

Now add Item in and Intellisense puts up a bracket as if Item was a method $xmlobj.PatchScan.Machine.Product.Item( ← See that? So that is why I think for some reason the Item node is doing something strange and that is my roadblock.

This screenshot shows better how it starts with many product folders, and then in each product folder is many item folders.

enter image description here

The XML in the product folder I don't care about. I need the individual information in each item folder.


Solution 1:

XML is a structured text format. It knows nothing about "folders". What you see in your screenshots is just how the the data is rendered by program you use for displaying it.

Anyway, the best approach to get what you want is using SelectNodes() with an XPath expression. As usual.

[xml]$xml = Get-Content 'X:\folder\my.xml'
$xml.SelectNodes('//Product/Item[@Class="Patch"]') |
    Select-Object BulletinID, PatchName, Status

Solution 2:

tl;dr

As you suspected, a name collision prevented prevented access to the .Item property on the XML elements of interest; fix the problem with explicit enumeration of the parent elements:

$xml.PatchScan.Machine.Product | % { $_.Item | select BulletinId, PatchName, Status }

% is a built-in alias for the ForEach-Object cmdlet; see bottom section for an explanation.


As an alternative, Ansgar Wiecher's helpful answer offers a concise XPath-based solution, which is both efficient and allows sophisticated queries.

As an aside: PowerShell v3+ comes with the Select-Xml cmdlet, which takes a file path as an argument, allowing for a single-pipeline solution:

(Select-Xml -LiteralPath X:\folder\my.xml '//Product/Item[@Class="Patch"]').Node |
  Select-Object BulletinId, PatchName, Status

Select-Xml wraps the matching XML nodes in an outer object, hence the need to access the .Node property.


PowerShell's adaptation of the XML DOM (dot notation):

PowerShell decorates the object hierarchy contained in [System.Xml.XmlDocument] instances (created with cast [xml], for instance):

  • with properties named for the input document's specific elements and attributes[1] at every level; e.g.:

     ([xml] '<foo><bar>baz</bar></foo>').foo.bar # -> 'baz'
     ([xml] '<foo><bar id="1" /></foo>').foo.bar.id # -> '1'
    
  • turning multiple elements of the same name at a given hierarchy level implicitly into arrays (specifically, of type [object[]]); e.g.:

     ([xml] '<foo><C>one</C><C>two</C></foo>').foo.C[1] # -> 'two'
    

As the examples (and your own code in the question) show, this allows for access via convenient dot notation.

Note: If you use dot notation to target an element that has at least one attribute and/or child elements, the element itself is returned (an XmlElement instance); otherwise, it is the element's text content; for information about updating XML documents via dot notation, see this answer.

The downside of dot notation is that there can be name collisions, if an incidental input-XML element name happens to be the same as either an intrinsic [System.Xml.XmlElement] property name (for single-element properties), or an intrinsic [Array] property name (for array-valued properties; [System.Object[]] derives from [Array]).

In the event of a name collision: If the property being accessed contains:

  • a single child element ([System.Xml.XmlElement]), the incidental properties win.

    • This too can be problematic, because it makes accessing intrinsic type properties unpredictable - see bottom section.
  • an array of child elements, the [Array] type's properties win.

    • Therefore, the following element names break dot notation with array-valued properties (obtained with reflection command
      Get-Member -InputObject 1, 2 -Type Properties, ParameterizedProperty):

          Item Count IsFixedSize IsReadOnly IsSynchronized Length LongLenth Rank SyncRoot
      

See the last section for a discussion of this difference and for how to gain access to the intrinsic [System.Xml.XmlElement] properties in the event of a collision.

The workaround is to use explicit enumeration of array-valued properties, using the ForEach-Object cmdlet, as demonstrated at the top.
Here is a complete example:

[xml] $xml = @'
<PatchScan>
  <Machine>
    <Product>
      <Name>Windows 10 Pro (x64)</Name>
      <Item Class="Patch">
        <BulletinId>MSAF-054</BulletinId>
        <PatchName>windows10.0-kb3189031-x64.msu</PatchName>
        <Status>Installed</Status>
      </Item>
      <Item Class="Patch">
        <BulletinId>MSAF-055</BulletinId>
        <PatchName>windows10.0-kb3189032-x64.msu</PatchName>
        <Status>Not Installed</Status>
      </Item>
    </Product>
    <Product>
      <Name>Windows 7 Pro (x86)</Name>
      <Item Class="Patch">
        <BulletinId>MSAF-154</BulletinId>
        <PatchName>windows7-kb3189031-x86.msu</PatchName>
        <Status>Partly Installed</Status>
      </Item>
      <Item Class="Patch">
        <BulletinId>MSAF-155</BulletinId>
        <PatchName>windows7-kb3189032-x86.msu</PatchName>
        <Status>Uninstalled</Status>
      </Item>
    </Product>
  </Machine>
</PatchScan>
'@

# Enumerate the array-valued .Product property explicitly, so that
# the .Item property can successfully be accessed on each XmlElement instance.
$xml.PatchScan.Machine.Product | 
  ForEach-Object { $_.Item | Select-Object BulletinID, PatchName, Status }

The above yields:

Class BulletinId PatchName                     Status          
----- ---------- ---------                     ------          
Patch MSAF-054   windows10.0-kb3189031-x64.msu Installed       
Patch MSAF-055   windows10.0-kb3189032-x64.msu Not Installed   
Patch MSAF-154   windows7-kb3189031-x86.msu    Partly Installed
Patch MSAF-155   windows7-kb3189032-x86.msu    Uninstalled     

Further down the rabbit hole: What properties are shadowed when:

Note: By shadowing I mean that in the case of a name collision, the "winning" property - the one whose value is reported - effectively hides the other one, thereby "putting it in the shadow".


In the case of using dot notation with arrays, a feature called member enumeration comes into play, which applies to any collection in PowerShell v3+; in other words: the behavior is not specific to the [xml] type.

In short: accessing a property on a collection implicitly accesses the property on each member of the collection (item in the collection) and returns the resulting values as an array ([System.Object[]]); .e.g:

# Using member enumeration, collect the value of the .prop property from
# the array's individual *members*.
> ([pscustomobject] @{ prop = 10 }, [pscustomobject] @{ prop = 20 }).prop
10
20

However, if the collection type itself has a property by that name, the collection's own property takes precedence; e.g.:

# !! Since arrays themselves have a property named .Count,
# !! member enumeration does NOT occur here.
> ([pscustomobject] @{ count = 10 }, [pscustomobject] @{ count = 20 }).Count
2  # !! The *array's* count property was accessed, returning the count of elements

In the case of using dot notation with [xml] (PowerShell-decorated System.Xml.XmlDocument and System.Xml.XmlElement instances), the PowerShell-added, incidental properties shadow the type-intrinsic ones:[2]

While this behavior is easy to grasp, the fact that the outcome depends on the specific input can also be treacherous:

For instance, in the following example the incidental name child element shadows the intrinsic property of the same name on the element itself:

> ([xml] '<xml><child>foo</child></xml>').xml.Name
xml  # OK: The element's *own* name

> ([xml] '<xml><name>foo</name></xml>').xml.Name
foo  # !! .name was interpreted as the incidental *child* element

If you do need to gain access to the intrinsic type's properties, use .get_<property-name>():

> ([xml] '<xml><name>foo</name></xml>').xml.get_Name()
xml  # OK - intrinsic property value to use of .get_*()

[1] If a given element has both an attribute and and element by the same name, PowerShell reports both, as the elements of an array [object[]].

[2] Seemingly, when PowerShell adapts the underlying System.Xml.XmlElement type behind the scenes, it doesn't expose its properties as such, but via get_* accessor methods, which still allows access as if they were properties, but with the PowerShell-added incidental-but-bona-fide properties taking precedence. Do let us know if you know more about this.