How do you spell check a website?
I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?
Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.
Edit: I have a list of all of the URLs on the site that I need to check.
Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).
lynx -dump http://www.example.com
It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:
lynx -dump http://www.example.com | grep -v "http"
The URLs could also be local (file://
) if I have used wget to mirror the site.
I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).
This will ignore text in title and meta elements. These can be spellchecked seperately.
Just a view days before i discovered Spello web site spell checker. It uses my NHunspell (Open office Spell Checker for .NET) libaray. You can give it a try.
If you can access the site's content as files, you can write a small Unix shell script that does the job. The following script will print the name of a file, line number, and misspelled words. The output's quality depends on that of your system's dictionary.
#!/bin/sh
# Find HTML files
find $1 -name \*.html -type f |
while read f
do
# Split file into words
sed '
# Remove CSS
/<style/,/<\/style/d
# Remove Javascript
/<script/,/<\/script/d
# Remove HTML tags
s/<[^>]*>//g
# Remove non-word characters
s/[^a-zA-Z]/ /g
# Split words into lines
s/[ ][ ]*/\
/g ' "$f" |
# Remove blank lines
sed '/^$/d' |
# Sort the words
sort -u |
# Print words not in the dictionary
comm -23 - /usr/share/dict/words >/tmp/spell.$$.out
# See if errors were found
if [ -s /tmp/spell.$$.out ]
then
# Print file, number, and matching words
fgrep -Hno -f /tmp/spell.$$.out "$f"
fi
done
# Remove temporary file
rm /tmp/spell.$$.out
I highly recomend Inspyder InSite, It is commercial software but they have a trial available, it is well worth the money. I have used it for years to check the spelling of client websites. It supports automation/scheduling and can integrate with CMS custom word lists. It is also a good way to link-check and can generate reports.
You could do this with a shell script combining wget with aspell. Did you have a programming environment in mind?
I'd personally use python with Beautiful Soup to extract the text from the tags, and pipe the text through aspell.