While loop data not appending to list outside of while loop
I am trying to scrape data, write it to a pd series then go into a while loop for the remaining pages of the website appending to the original series (located outside of the while loop) after each iteration. I'm not sure why this isn't working. Here's where I'm stuck:
current_url = 'https://www.yellowpages.com/search?search_terms=hvac&geo_location_terms=97080'
def get_data_run(current_url):
company_names1 = get_company_name(current_url)
print(company_names1) #1
page = 1
max_page = 3
company_names1 = paginate(current_url, page, max_page, company_names1)
print(company_names1) #2
def paginate(current_url, page, max_page, company_names1):
while (page <= max_page):
new_url = current_url + f"&page={page}"
print(new_url)
company_names = get_company_name(new_url)
company_names1.append(company_names)
print(company_names) #3
print(company_names1) #4
page +=1
if page == max_page:
return company_names1
def get_company_name(url):
company_names = []
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
box = list(soup.findAll("div", {"class": "result"}))
for i in range(len(box)):
try:
company_names.append(box[i].find("a", {"class": "business-name"}).text.strip())
except Exception:
company_names.append("null")
else:
continue
company_names = pd.Series(company_names, dtype='string')
return company_names
get_data_run(current_url)
I've labeled the different prints and all of the prints of company_names1
and company_names
and each time company_names1
it prints the same series of companies even after appending company_names
inside the while loop. The thing I can't understand is that when I print company_names
(#3) it prints the next page of company names. I don't understand why its not appending inside the while loop then why it's not returning outside of the function successfully and printing the combined series in the #2 print. Thanks!
UPDATE: Here is some sample output:
when I print #3:
(pyfinance) justinbenfit@MacBook-Pro-3 yellowpages_scrape % /usr/local/anaconda3/envs/pyfinance/bin/python /Users/justinbenfit/Desktop/yellowpages_scrape/test.py
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Mt. Hood Heating Cooling & Refrigeration
22 Chuck's Heating & Cooling
23 Mr. Furnace
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
0 Air-Trix
1 Johnstone Supply
2 Buss Heating & Cooling Inc
3 The Heat Exchange
4 Hoodview Heating & Air Conditioning
5 Loomis Heating Cooling & Refrigeration
6 All About Air Heating & Cooling
7 Hanson Heating
8 Sparks Heating & Cooling
9 Interior Comfort Systems
10 P D X Heating & Cooling
11 Apcom Power Inc
12 Area Heating Inc
13 Four Seasons Heating Air Conditioning & Servic...
14 Perfect Climate Inc
15 Combustion Consultants Inc
16 Classic Heat Source, Inc.
17 Multnomah Heating, Inc
18 Apollo Plumbing, Heating & Air Conditioning - OR
19 Art's Furnace & Air Cond
20 Kurchel Heating
21 P & O Construction Inc
22 Systems Management NW
23 Bridgetown Heating
24 Amana Heating & Air Conditioning Systems
25 QualitySmith
26 Wilbert Jr, Wilson
27 Faith Heating & Air Conditioning Inc
28 Northwest Commercial Heating & Air Conditionin...
29 Heat Master Corp
dtype: string
when I print #1, #2, and #4
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Chuck's Heating & Cooling
22 Mr. Furnace
23 Mt. Hood Heating Cooling & Refrigeration
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
Solution 1:
The problem is you're treating pd.Series
as a list
, but the former are immutable while the later are mutable. This means, appending data to a list works like this:
lst = [1,2,3]
lst.append(4)
print(lst)
# [1, 2, 3, 4]
The object changes without having to explicitly assign it. If you do the same with Series
, this happens:
series = pd.Series([1,2,3])
series.append(pd.Series([4]))
print(series)
The output is:
0 1
1 2
2 3
dtype: int64
So, to update a Series, you have to replace the original object or create a new one. If there's no assignment it won't be stored in memory:
series = pd.Series([1,2,3])
series = series.append(pd.Series([4]))
print(series)
Output:
0 1
1 2
2 3
0 4
dtype: int64
In the case of your problem it lies in the paginate
function, you should change this line:
company_names1.append(company_names)
to:
company_names1 = company_names1.append(company_names)
And everything should work