How do I stop getting an object from two similar named XML nodes when am creating a custom object
I am trying to parse several RSS news feeds which I will later filter based on what I am looking for. Each feed has a slightly different XML Schema but in general has a Title, Description, link and pubDate. Some use a CDATA section, and some don't, so I incorporated and if statement for those that use it. I am trying to write one routine that matches all. Here is a sample of the XML giving me the headache:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title><![CDATA[ABC7 RSS Feed]]></title>
<link><![CDATA[https://abc7news.com/feed]]></link>
<lastBuildDate><![CDATA[Thu, 13 Jan 2022 15:49:04 +0000]]></lastBuildDate>
<pubDate><![CDATA[Thu, 13 Jan 2022 15:49:04 +0000]]></pubDate>
<description>Keep up with news from your local ABC station.</description>
<copyright>Copyright 2022 ABC Inc., KGO-TV San Francisco</copyright>
<managingEditor>[email protected](KGO-TV San Francisco)</managingEditor>
<webMaster>[email protected](KGO-TV San Francisco)</webMaster>
<language><![CDATA[en]]></language>
<item>
<title><![CDATA[Biden gives COVID response update; administration to deploy military teams to hospitals | LIVE]]></title>
<description><![CDATA[Starting next week, 1,000 military medical personnel will begin arriving to help mitigate staffing crunches at hospitals across the country. ]]></description>
<pubDate><![CDATA[Thu, 13 Jan 2022 15:38:02 +0000]]></pubDate>
<link><![CDATA[https://abc7news.com/us-covid-biden-speech-today-hospitalizations/11462828/]]></link>
<type><![CDATA[post]]></type>
<guid><![CDATA[https://abc7news.com/us-covid-biden-speech-today-hospitalizations/11462828/]]></guid>
<dc:creator><![CDATA[AP]]></dc:creator>
<media:keywords><![CDATA[us covid, biden covid, biden speech today, covid hospitalizations, omicron variant, us hospitals, covid cases, covid omicron, biden military medical teams]]></media:keywords>
<category><![CDATA[Health & Fitness,omicron variant,Coronavirus,military,joe biden,hospitals,u.s. & world]]></category>
<guid isPermaLink="false">health/live-biden-highlighting-federal-surge-to-help-weather-omicron/11462828/</guid>
</item>
<item>
<title><![CDATA[Massive backup on Bay Bridge after early morning crash]]></title>
<description><![CDATA[A massive backup continues on the Bay Bridge after an earlier multi-vehicle crash past Treasure Island.]]></description>
<pubDate><![CDATA[Thu, 13 Jan 2022 15:30:15 +0000]]></pubDate>
<link><![CDATA[https://abc7news.com/bay-bridge-crash-traffic-accident-sf-commute/11463119/]]></link>
<type><![CDATA[post]]></type>
<guid><![CDATA[https://abc7news.com/bay-bridge-crash-traffic-accident-sf-commute/11463119/]]></guid>
<dc:creator><![CDATA[KGO]]></dc:creator>
<media:title><![CDATA[Crash triggers massive backup on Bay Bridge]]></media:title>
<media:description><![CDATA[A crash on the Bay Bridge triggered massive gridlock for the Thursday morning commute.]]></media:description>
<media:videoId>11463404</media:videoId>
<media:thumbnail url="https://cdn.abcotvs.com/dip/images/11463261_011322-kgo-sky7-bay-bridge-traffic-img.jpg" width="1280" height="720" />
<enclosure url="https://vcl.abcotv.net/video/kgo/011322-kgo-6am-bay-bridge-crash-vid.mp4" length="79" type="video/mp4" />
<media:keywords><![CDATA[Bay Bridge crash, traffic, accident, SF commute, Oakland drive times, bay bridge toll plaza backup, Bay Area, treasure island,]]></media:keywords>
<category><![CDATA[Traffic,Treasure Island,Oakland,San Francisco,CHP,bay bridge,crash]]></category>
<guid isPermaLink="false">traffic/massive-backup-on-bay-bridge-after-early-morning-crash/11463119/</guid>
</item>
</channel>
</rss>
and Here is the parsing code which puts each item into a object ($posts):
$rss = [xml] (Get-Content 'I:\RSS_Project\Feeds\feed-3.xml')
$rss.SelectNodes('//item')|%{
$posts += New-Object psobject -Property @{
Title = If($_.Title."#cdata-section"){$_.Title."#cdata-section"}else{$_.Title}
Desc = If($_.description."#cdata-section"){$_.description."#cdata-section"}else{$_.Title}
link = If($_.link."#cdata-section"){$_.link."#cdata-section"}else{$_.link}
pubDate = If($_.pubDate."#cdata-section"){$_.pubDate."#cdata-section"}else{$_.pubDate}
}
}
I get the right link and pubDate with this feed but because there is a media:title and media:description in some items,(yes not consistent in the same feed), and so I get {title,media:title} output into the $posts.title custom object I created.
With this data it would be {Massive backup on Bay Bridge after early morning crash,Crash triggers massive backup on Bay Bridge}. I can't figure out how to avoid capturing the media:title data. My other XML feeds don't have the media:title.
Can I do I pre-emptive strike and remove this ahead of time if it exists in any feeds? I tried using $_.Title[0] which worked on this feed but as the other feeds don't have the array, it did not work on those. I have the same issue where media:description exists in the item. I output the data into an HTML table which only lists "System.Object" when I have the title or description array. Any help to eliminate the media:title into my object would be greatly appreciated.
Solution 1:
PowerShell's XML type adapter can be a bit "wonky" (for lack of a better technical term), because it attempts to simplify something complex - and as a result, it simply ignores namespace prefixes and resolves nodes by their local name instead, leading to $_.title
resolving both the <title>
and <media:title>
elements.
Instead, use XPath to resolve the values as well:
$fields = 'title','description','pubDate','link'
$posts = foreach($item in $rss.SelectNodes('//item')) {
# create dictionary to hold properties of the object we want to construct
$properties = [ordered]@{}
# now let's try to resolve them all
foreach($fieldName in $fields) {
# use a relative XPath expression to extract relevant child node from current item
$value = $item.SelectSingleNode("./${fieldName}")
# handle content wrapped in CData
if($value.HasChildNodes -and $value.ChildNodes[0] -is [System.Xml.XmlCDataSection]){
$value = $value.ChildNodes[0]
}
# add node value to dictionary
$properties[$fieldName] = $value.InnerText
}
# output resulting object
[pscustomobject]$properties
}