Pandas read_xml and SEPA (CAMT 053) XML

Recently I wanted to try the newly implemented xml_read function within pandas. I thought about testing the feature with SEPA camt-format xml. I'm stuck with the functions parameters, as I'm unfamiliar with the lxml logic. I tried pointing to the transactions values as rows ("Ntry" tag), as I thought this will then loop through those rows and creates the dataframe. Setting xpath to default returns an empty dataframe with the columns "GrpHdr" and "Rpt", but the relevant data is one level below "Rpt". Setting xpath='//*' creates a huge dataframe with every tag as column and values randomly sorted. If anyone is familiar with using the pandas xml_read and nested xmls, I'd appreciate any hints. The xml file looks like this (fake values):

<Document>
<BkToCstmrAcctRpt>
<GrpHdr>
<MsgId>Account</MsgId>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<MsgRcpt>
<Nm> Name</Nm>
</MsgRcpt>
</GrpHdr>
<Rpt>
<Id>Account ID</Id>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<Acct>
<Id>
<IBAN>DEXXXXX</IBAN>
</Id>
</Acct>
<Bal>
<Tp>
<CdOrPrtry>
</CdOrPrtry>
</Tp>
<Amt Ccy="EUR">161651651651</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Dt>
<DtTm>2021-08-05T14:20:23.077+02:00</DtTm>
</Dt>
</Bal>
<Ntry>
<Amt Ccy="EUR">11465165</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Sts>BOOK</Sts>
<BookgDt>
<Dt>2021-08-02</Dt>
</BookgDt>
<ValDt>
<Dt>2021-08-02</Dt>
</ValDt>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+65454</Cd>
<Issr>DFE</Issr>
</Prtry>
</BkTxCd>
<NtryDtls>
<TxDtls>
<Amt Ccy="EUR">4945141.0</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+55155</Cd>
<Issr>DFEsds</Issr>
</Prtry>
</BkTxCd>
<RltdPties>
<Dbtr>
<Nm>Name</Nm>
</Dbtr>
<Cdtr>
<Nm>Name</Nm>
</Cdtr>
</RltdPties>
<RmtInf>
<Ustrd>Referenz NOTPROVIDED</Ustrd>
<Ustrd> Buchug</Ustrd>
</RmtInf>
</TxDtls>
</NtryDtls>
</Ntry>
</Rpt>
</BkToCstmrAcctRpt>
</Document>

Solution 1:

The bank statement is not a shallow xml, thus not very suitable for pandas.read_xml (as indicated in the documentation).

Instead I suggest to use sepa library.

from sepa import parser
import re
import pandas as pd

# Utility function to remove additional namespaces from the XML
def strip_namespace(xml):
    return re.sub(' xmlns="[^"]+"', '', xml, count=1)

# Read file
with open('example.xml', 'r') as f:
    input_data = f.read()

# Parse the bank statement XML to dictionary
camt_dict = parser.parse_string(parser.bank_to_customer_statement, bytes(strip_namespace(input_data), 'utf8'))

statements = pd.DataFrame.from_dict(camt_dict['statements'])
for i,_ in d.iterrows():
    if 'entries' in camt_dict['statements'][i]:
        dd = pd.DataFrame.from_records(camt_dict['statements'][i]['entries'] )