How do you spell check a website?

I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?

Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.

Edit: I have a list of all of the URLs on the site that I need to check.

Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).

lynx -dump

It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:

lynx -dump | grep -v "http"

The URLs could also be local (file://) if I have used wget to mirror the site.

I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).

This will ignore text in title and meta elements. These can be spellchecked seperately.

Just a view days before i discovered Spello web site spell checker. It uses my NHunspell (Open office Spell Checker for .NET) libaray. You can give it a try.

If you can access the site's content as files, you can write a small Unix shell script that does the job. The following script will print the name of a file, line number, and misspelled words. The output's quality depends on that of your system's dictionary.


# Find HTML files
find $1 -name \*.html -type f |
while read f
        # Split file into words
        sed '
# Remove CSS
# Remove Javascript
# Remove HTML tags
# Remove non-word characters
s/[^a-zA-Z]/ /g
# Split words into lines
s/[     ][      ]*/\
/g ' "$f" |
        # Remove blank lines
        sed '/^$/d' |
        # Sort the words
        sort -u |
        # Print words not in the dictionary
        comm -23 - /usr/share/dict/words >/tmp/spell.$$.out
        # See if errors were found
        if [ -s /tmp/spell.$$.out ]
                # Print file, number, and matching words
                fgrep -Hno -f /tmp/spell.$$.out "$f"
# Remove temporary file
rm /tmp/spell.$$.out

I highly recomend Inspyder InSite, It is commercial software but they have a trial available, it is well worth the money. I have used it for years to check the spelling of client websites. It supports automation/scheduling and can integrate with CMS custom word lists. It is also a good way to link-check and can generate reports.

You could do this with a shell script combining wget with aspell. Did you have a programming environment in mind?

I'd personally use python with Beautiful Soup to extract the text from the tags, and pipe the text through aspell.