Get a list of URLs from a site [closed]
I didn't mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.
do wget -r -l0 www.oldsite.com
Then just find www.oldsite.com
would reveal all urls, I believe.
Alternatively, just serve that custom not-found page on every 404 request! I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.
Here is a list of sitemap generators (from which obviously you can get the list of URLs from a site): http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
Web Sitemap Generators
The following are links to tools that generate or maintain files in the XML Sitemaps format, an open standard defined on sitemaps.org and supported by the search engines such as Ask, Google, Microsoft Live Search and Yahoo!. Sitemap files generally contain a collection of URLs on a website along with some meta-data for these URLs. The following tools generally generate "web-type" XML Sitemap and URL-list files (some may also support other formats).
Please Note: Google has not tested or verified the features or security of the third party software listed on this site. Please direct any questions regarding the software to the software's author. We hope you enjoy these tools!
Server-side Programs
- Enarion phpSitemapsNG (PHP)
- Google Sitemap Generator (Linux/Windows, 32/64bit, open-source)
- Outil en PHP (French, PHP)
- Perl Sitemap Generator (Perl)
- Python Sitemap Generator (Python)
- Simple Sitemaps (PHP)
- SiteMap XML Dynamic Sitemap Generator (PHP) $
- Sitemap generator for OS/2 (REXX-script)
- XML Sitemap Generator (PHP) $
CMS and Other Plugins:
- ASP.NET - Sitemaps.Net
- DotClear (Spanish)
- DotClear (2)
- Drupal
- ECommerce Templates (PHP) $
- Ecommerce Templates (PHP or ASP) $
- LifeType
- MediaWiki Sitemap generator
- mnoGoSearch
- OS Commerce
- phpWebSite
- Plone
- RapidWeaver
- Textpattern
- vBulletin
- Wikka Wiki (PHP)
- WordPress
Downloadable Tools
- GSiteCrawler (Windows)
- GWebCrawler & Sitemap Creator (Windows)
- G-Mapper (Windows)
- Inspyder Sitemap Creator (Windows) $
- IntelliMapper (Windows) $
- Microsys A1 Sitemap Generator (Windows) $
- Rage Google Sitemap Automator $ (OS-X)
- Screaming Frog SEO Spider and Sitemap generator (Windows/Mac) $
- Site Map Pro (Windows) $
- Sitemap Writer (Windows) $
- Sitemap Generator by DevIntelligence (Windows)
- Sorrowmans Sitemap Tools (Windows)
- TheSiteMapper (Windows) $
- Vigos Gsitemap (Windows)
- Visual SEO Studio (Windows)
- WebDesignPros Sitemap Generator (Java Webstart Application)
- Weblight (Windows/Mac) $
- WonderWebWare Sitemap Generator (Windows)
Online Generators/Services
- AuditMyPc.com Sitemap Generator
- AutoMapIt
- Autositemap $
- Enarion phpSitemapsNG
- Free Sitemap Generator
- Neuroticweb.com Sitemap Generator
- ROR Sitemap Generator
- ScriptSocket Sitemap Generator
- SeoUtility Sitemap Generator (Italian)
- SitemapDoc
- Sitemapspal
- SitemapSubmit
- Smart-IT-Consulting Google Sitemaps XML Validator
- XML Sitemap Generator
- XML-Sitemaps Generator
CMS with integrated Sitemap generators
- Concrete5
Google News Sitemap Generators The following plugins allow publishers to update Google News Sitemap files, a variant of the sitemaps.org protocol that we describe in our Help Center. In addition to the normal properties of Sitemap files, Google News Sitemaps allow publishers to describe the types of content they publish, along with specifying levels of access for individual articles. More information about Google News can be found in our Help Center and Help Forums.
- WordPress Google News plugin
Code Snippets / Libraries
- ASP script
- Emacs Lisp script
- Java library
- Perl script
- PHP class
- PHP generator script
If you believe that a tool should be added or removed for a legitimate reason, please leave a comment in the Webmaster Help Forum.
The best on I have found is http://www.auditmypc.com/xml-sitemap.asp which uses Java, and has no limit on pages, and even lets you export results as a raw URL list.
It also uses sessions, so if you are using a CMS, make sure you are logged out before you run the crawl.
So, in an ideal world you'd have a spec for all pages in your site. You would also have a test infrastructure that could hit all your pages to test them.
You're presumably not in an ideal world. Why not do this...?
Create a mapping between the well known old URLs and the new ones. Redirect when you see an old URL. I'd possibly consider presenting a "this page has moved, it's new url is XXX, you'll be redirected shortly".
If you have no mapping, present a "sorry - this page has moved. Here's a link to the home page" message and redirect them if you like.
Log all redirects - especially the ones with no mapping. Over time, add mappings for pages that are important.