What is the difference between web-crawling and web-scraping? [duplicate]
Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently.
Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) crawler wouldn't do, i.e.:
- Have no regard for robots.txt
- Identify itself as a browser
- Submit forms with data
- Execute Javascript (if required to act like a user)
Yes, they are different. In practice, you may need to use both.
(I have to jump in because, so far, the other answers don't get to the essence of it. They use examples but don't make the distinctions clear. Granted, they are from 2010!)
Web scraping, to use a minimal definition, is the process of processing a web document and extracting information out of it. You can do web scraping without doing web crawling.
Web crawling, to use a minimal definition, is the process of iteratively finding and fetching web links starting from a list of seed URL's. Strictly speaking, to do web crawling, you have to do some degree of web scraping (to extract the URL's.)
To clear up some concepts mentioned in the other answers:
robots.txt
is intended to apply to any automated process that accesses a web page. So it applies to both crawlers and scrapers.'Proper' crawlers and scrapers, both, should identify themselves accurately.
Some references:
- Wikipedia on web scraping
- Wikipedia on web crawlers
- Wikipedia on robots.txt
AFAIK Web Crawling is what Google does - it goes around a website looking at links and building a database of the layout of that site and sites it links to
Web Scraping would be the progamatic analysis of a web page to load some data off of it, EG loading up BBC weather and ripping (scraping) the weather forcast off of it and placing it elsewhere or using it in another program.