Cannot create a crontab job for my scrapy program
I have written a small Python scraper (using Scrapy framework). The scraper requires a headless browse... I am using ChromeDriver.
As I am running this code on an Ubuntu server which does not have any GUI, I had to install Xvfb in order to run ChromeDriver on my Ubuntu server (I followed this guide)
This is my code:
class MySpider(scrapy.Spider):
name = 'my_spider'
def __init__(self):
# self.driver = webdriver.Chrome(ChromeDriverManager().install())
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)
I can run the above code from Ubuntu shell and it execute without any errors:
ubuntu@ip-1-2-3-4:~/scrapers/my_scraper$ scrapy crawl my_spider
Now I want to setup a cron job to run the above command everyday:
# m h dom mon dow command
PATH=/usr/local/bin:/home/ubuntu/.local/bin/
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1
but the crontab job gives me the following error:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 192, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 196, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 86, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 98, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/spiders/__init__.py", line 19, in from_crawler
spider = cls(*args, **kwargs)
File "/home/ubuntu/scrapers/my_scraper/my_scraper/spiders/spider.py", line 27, in __init__
self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 5.4.0-1029-aws x86_64)
Update
This answer help me solve the issue (but I don't quite understand why)
I ran echo $PATH
on my Ubuntu shell and copied the value into the crontab:
PATH=/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1
Note: As I have created a bounty for this question, I am happy to award it to any answer which explains why changing the PATH solved the issue.
This is the reason of almost all the cases where cron
doesn't seems to run.
Cron always runs with a mostly empty environment. HOME
, LOGNAME
, and SHELL
are set; and a very limited PATH
. It is therefore advisable to use complete paths to executables, and export any variables you need in your script when using cron
.
You can also:
-
Use the environment variables you use on your shell
-
Simulate it, by temporarily adding this to your crontab and wait a minute to save the cron environment to
~/cronenv
(then you can remove this):* * * * * env > ~/cronenv
Then test running a shell (by default,
SHELL=/bin/sh
) with exactly that environment:env - $(cat ~/cronenv) /bin/sh
-
Force the crontab to run.
Also, you can't use variable substitution as in shell, so a declaration like PATH=/usr/local/bin:$PATH
is interpreted literally.
The commands readlink
, dirname
and cat
could not be located because /bin
is not included in the PATH
environment variable.
Explain
unknown error: Chrome failed to start: exited abnormally The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.
Try to set PATH=/usr/local/bin:/home/ubuntu/.local/bin/
and execute
/usr/bin/google-chrome --no-sandbox --headless --disable-dev-shm-usage
you'll get
/usr/bin/google-chrome: line 8: readlink: command not found
/usr/bin/google-chrome: line 10: dirname: command not found
/usr/bin/google-chrome: line 45: exec: cat: not found
/usr/bin/google-chrome: line 46: exec: cat: not found
You can also try this one. Crontab opens a new shell for user ubuntu.
05 12 * * * su - ubuntu -c 'cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1'