Concatenating scraped data
Im scraping data from a website and I managed to get the following kinda of output:
['Datas',
'1999',
'2000',
'2001',
'Receita',
'líquida',
'29592',
'49782',
'57511',
'Custos',
'-18937',
'-28938',
'-34855',
'IR',
'e',
'CSSL',
'-486',
'-4361',
'-3875',
[...]]
I need to concatenate the text data so I can use them as columns titles later in Pandas, so I wrote the following if function as a test:
array = ['FIRST',1,2,3,'PALAVRA',8,3,"FRASE","SEGUIDA",3,"CUSTO","DE","OPERAÇÃO",5]
arrayb = str(array).replace(",","\n").replace(" ","").replace("'","").replace("[","").replace("]","").replace("'","")
arrayc = arrayb.splitlines()
last_key = None
fszsr = {}
for i in arrayc:
if i.isalpha():
idx=arrayc.index(i)
idxp1=idx+1
idxp1b=arrayc[idxp1]
if idxp1b.isalpha():
idxp2=idxp1+1
idxp2b=arrayc[idxp2]
arrayc.remove(idxp1b)
if idxp2b.isalpha():
last_key=i+idxp1b+idxp2b
arrayc.remove(idxp2b)
fszsr[last_key] = [i+idxp1b+idxp2b]
else:
last_key=i+idxp1b
fszsr[last_key] = [i+idxp1b]
else:
last_key=i
fszsr[last_key] = [i]
else:
last_key=i
fszsr[last_key].append(i)
fszsr
But the output just shows this:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19444/1702247249.py in <module>
26 else:
27 last_key=i
---> 28 fszsr[last_key].append(i)
29 fszsr
KeyError: '1'
Can't figure out what Im doing wrong, tried changing the list to str but still don't work, If i just change the append part to keep the last_key it shows a output like this:
{'FIRST': ['FIRST'],
'1': ['1'],
'2': ['2'],
'3': ['3'],
'PALAVRA': ['PALAVRA'],
'8': ['8'],
'FRASESEGUIDA': ['FRASESEGUIDA'],
'CUSTODEOPERAÇÃO': ['CUSTODEOPERAÇÃO'],
'5': ['5']}
Also, this was my way of managing the scraped data, maybe theres a better way with the raw data? Thats the first output I get from scraping the website:
'1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021\nReceita líquida\n29.592 49.782 57.511 69.176 95.742 108.201 136.605 158.238 170.577 215.118 182.710 213.273 244.176 281.379 304.889 337.259 321.638 282.589 283.695 349.836 302.245 272.069 393.450\nCustos\n-18.937 -28.938 -34.855 -44.205 -52.893 -63.100 -77.107 -94.665 -104.398 -141.623 -109.037 -136.051 -166.939 -210.472 -233.725 -256.335 -223.062 -192.611 -192.100 -225.293 -180.140 -148.107 -192.500\nResultado bruto\n10.655 20.844 22.656 24.971 42.849 45.101 59.498 63.573 66.179 73.495 73.673 77.222 77.237 70.907 71.164 80.924 98.576 89.978 91.595 124.543 122.105 123.962 200.950\nMargem bruta\n36% 42% 39% 36% 45% 42% 44% 40% 39% 34% 40% 36% 32% 25% 23% 24% 31% 32% 32% 36% 40% 46% 51%\nDespesas oper.\n-7.975 -6.815 -9.777 -14.236 -14.082 -15.209 -19.728 -21.625 -29.854 -24.591 -28.116 -31.647 -33.008 -39.431 -36.807 -102.841 -111.764 -73.496 -53.822 -59.667 -40.404 -74.341 19.601\nRes. operacional\n2.680 14.029 12.879 10.735 28.767 29.892 39.770 41.948 36.325 48.904 45.557 45.575 44.229 31.476 34.357 -21.917 -13.188 16.482 37.773 64.876 81.701 49.621 220.551\nMargem Oper.\n9% 28% 22% 16% 30% 28% 29% 27% 21% 23% 25% 21% 18% 11% 11% -6% -4% 6% 13% 19% 27% 18% 56%\nRes. Financeiro\n-399 309 1.032 1.166 -1.377 -3.171 -3.213 -1.341 -785 -698 -2.349 2.562 122 -3.722 -6.202 -3.899 -28.041 -27.185 -31.599 -21.100 -34.459 -49.584 -38.640\nIR e CSSL\n-486 -4.361 -3.875 -4.008 -7.815 -7.249 -10.802 -11.896 -11.272 -15.961 -9.977 -12.235 -11.241 -6.794 -5.147 3.892 6.058 -2.342 -5.797 -17.078 -16.400 6.209 -45.918\nOp. descontin.\n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10.128 0 0\nLucro líquido\n1.795 9.977 10.036 7.893 19.575 19.472 25.755 28.711 24.268 32.245 33.231 35.902 33.110 20.960 23.008 -21.924 -35.171 -13.045 377 26.698 40.970 6.246 135.993\nMargem líquida\n6% 20% 17% 11% 20% 18% 19% 18% 14% 15% 18% 17% 14% 7% 8% -7% -11% -5% 0% 8% 14% 2% 35%'
Thanks in advance!
Solution 1:
You can read each element and check if this is a digit or not, if not use it to build a key, if digit add as element with the last available key. We can use collections.defaultdict
to ease the code (but not strictly necessary).
from collections import defaultdict
out = defaultdict(list)
key = None
reset_key = True
for item in array:
if str(item).isdigit():
out[key].append(item)
reset_key = True
elif reset_key:
key = item
reset_key = False
else:
key += f' {item}'
dict(out)
Output:
{'FIRST': [1, 2, 3],
'PALAVRA': [8, 3],
'FRASE SEGUIDA': [3],
'CUSTO DE OPERAÇÃO': [5]}
Input:
array = ['FIRST',1,2,3,'PALAVRA',8,3,"FRASE","SEGUIDA",3,"CUSTO","DE","OPERAÇÃO",5]