Search in html source with GOOGLE? [closed]
I've come across the following resources on my travels (some already mentioned above):
HTML Mark-up-focused search engines
- Nerdydata
I'd also like to throw in the following:
Huge, website crawl data archives
- Common Crawl - 'years of free web page data to help change the world' (over 250TB+)
How can we analyze this crawl data?
For an idea of how to begin analyzing some of this massive data, take a look at Big Data/Map-reduce-type frameworks(s).
Google lists some ideas on using Apache's Spark project to analyze Common Crawl's dump(s). To understand the file format(s) used by Common Crawl, refer to the following:
- So you’re ready to get started [with Common Crawl]
- Navigating the WARC file format [by Common Crawl]
The article, Accessing-Common-Crawl-Dataset-on-S3, outlines accessing Common Crawl's 250TB+ dump(s) in a low cost manner without transferring that data load outside of Amazon's AWS/S3 network. Of course, that assumes you are going to use some combination AWS/EC2/S3 etc. to analyze the crawl data.
Finally, Patrick Durusau maintains some interesting Common-Crawl-usage-related blog pages.
Personally, I find this subject intriguing, I suggest we get this crawl data while it's HOT! ;-)
You can try PublicWWW for search in source/mark-up. It allows to find any HTML, JavaScript, CSS and plain text in web page source code on 167+ million websites.
With PublicWWW you can:
Find related websites through the unique HTML codes they share, i.e. widgets & publisher IDs.
Identify sites using certain images or badges.
- Find out who else is using your theme.
- Identify sites mentioning you.
- Find your competitor's affiliates.
- Identify sites where your competitors personally collaborate or interact.
- References to use a library or a platform.
- Find code examples on the net.
- Figure out who is using what JS widgets on their sites.
- ...
Of course you can find not only your websites which use some code/mark-up snippet.