The Best Way to shred XML data into SQL Server database columns

What is the best way to shred XML data into various database columns? So far I have mainly been using the nodes and value functions like so:

INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value('(@column1)[1]', 'varchar(20)'),
Rows.n.value('(@column2)[1]', 'nvarchar(100)'),
Rows.n.value('(@column3)[1]', 'int'),
FROM @xml.nodes('//Rows') Rows(n)

However I find that this is getting very slow for even moderate size xml data.


Solution 1:

Stumbled across this question whilst having a very similar problem, I'd been running a query processing a 7.5MB XML file (~approx 10,000 nodes) for around 3.5~4 hours before finally giving up.

However, after a little more research I found that having typed the XML using a schema and created an XML Index (I'd bulk inserted into a table) the same query completed in ~ 0.04ms.

How's that for a performance improvement!

Code to create a schema:

IF EXISTS ( SELECT * FROM sys.xml_schema_collections where [name] = 'MyXmlSchema')
DROP XML SCHEMA COLLECTION [MyXmlSchema]
GO

DECLARE @MySchema XML
SET @MySchema = 
(
    SELECT * FROM OPENROWSET
    (
        BULK 'C:\Path\To\Schema\MySchema.xsd', SINGLE_CLOB 
    ) AS xmlData
)

CREATE XML SCHEMA COLLECTION [MyXmlSchema] AS @MySchema 
GO

Code to create the table with a typed XML column:

CREATE TABLE [dbo].[XmlFiles] (
    [Id] [uniqueidentifier] NOT NULL,

    -- Data from CV element 
    [Data] xml(CONTENT dbo.[MyXmlSchema]) NOT NULL,

CONSTRAINT [PK_XmlFiles] PRIMARY KEY NONCLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

Code to create Index

CREATE PRIMARY XML INDEX PXML_Data
ON [dbo].[XmlFiles] (Data)

There are a few things to bear in mind though. SQL Server's implementation of Schema doesn't support xsd:include. This means that if you have a schema which references other schema, you'll have to copy all of these into a single schema and add that.

Also I would get an error:

XQuery [dbo.XmlFiles.Data.value()]: Cannot implicitly atomize or apply 'fn:data()' to complex content elements, found type 'xs:anyType' within inferred type 'element({http://www.mynamespace.fake/schemas}:SequenceNumber,xs:anyType) ?'.

if I tried to navigate above the node I had selected with the nodes function. E.g.

SELECT
    ,C.value('CVElementId[1]', 'INT') AS [CVElementId]
    ,C.value('../SequenceNumber[1]', 'INT') AS [Level]
FROM 
    [dbo].[XmlFiles]
CROSS APPLY
    [Data].nodes('/CVSet/Level/CVElement') AS T(C)

Found that the best way to handle this was to use the OUTER APPLY to in effect perform an "outer join" on the XML.

SELECT
    ,C.value('CVElementId[1]', 'INT') AS [CVElementId]
    ,B.value('SequenceNumber[1]', 'INT') AS [Level]
FROM 
    [dbo].[XmlFiles]
CROSS APPLY
    [Data].nodes('/CVSet/Level') AS T(B)
OUTER APPLY
    B.nodes ('CVElement') AS S(C)

Hope that that helps someone as that's pretty much been my day.

Solution 2:

in my case i'm running SQL 2005 SP2 (9.0).

The only thing that helped was adding OPTION ( OPTIMIZE FOR ( @your_xml_var = NULL ) ). Explanation is on the link below.

Example:

INSERT INTO @tbl (Tbl_ID, Name, Value, ParamData)
SELECT     1,
    tbl.cols.value('name[1]', 'nvarchar(255)'),
    tbl.cols.value('value[1]', 'nvarchar(255)'),
    tbl.cols.query('./paramdata[1]')
FROM @xml.nodes('//root') as tbl(cols) OPTION ( OPTIMIZE FOR ( @xml = NULL ) )

https://connect.microsoft.com/SQLServer/feedback/details/562092/an-insert-statement-using-xml-nodes-is-very-very-very-slow-in-sql2008-sp1

Solution 3:

I'm not sure what is the best method. I used OPENXML construction:

INSERT INTO Test
SELECT Id, Data 
FROM OPENXML (@XmlDocument, '/Root/blah',2)
WITH (Id   int         '@ID',
      Data varchar(10) '@DATA')

To speed it up, you can create XML indices. You can set index specifically for value function performance optimization. Also you can use typed xml columns, which performs better.

Solution 4:

We had a similar issue here. Our DBA (SP, you the man) took a look at my code, made a little tweak to the syntax, and we got the speed we had been expecting. It was unusual because my select from XML was plenty fast, but the insert was way slow. So try this syntax instead:

INSERT INTO some_table (column1, column2, column3)
    SELECT 
        Rows.n.value(N'(@column1/text())[1]', 'varchar(20)'), 
        Rows.n.value(N'(@column2/text())[1]', 'nvarchar(100)'), 
        Rows.n.value(N'(@column3/text())[1]', 'int')
    FROM @xml.nodes('//Rows') Rows(n) 

So specifying the text() parameter really seems to make a difference in performance. Took our insert of 2K rows from 'I must have written that wrong - let me stop it' to about 3 seconds. Which was 2x faster than the raw insert statements we had been running through the connection.