[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: off topic - diff for html docs?



On Thu, 1 Oct 1998, guy keren wrote:

> diff is not usefull for seeing _content_ changes of html files. i don't
> care if a tag was changed, or if one or more empty lines were added, or if
> a new editor inserted its comment at the start of te page. 

imho this is a Perl/awk problem. If you have large amounts, flex + C. Just
write a proggie to drop anything between '<>'s and add a tail that
compresses any whitespace into one newline. This should give a list of
plain words. You can compare with diff or even wc, or you can md5 the
result and store for reference. You can also do other literate programming
statistics analysis on the output.

If you want to get rid of the header, implement a gate turned on by
<BODY>, then start processing output.

imho you SHOULD care whether an editor has inserted its comment at the
top.  Don't ask why. 

imho it takes at most 2 hours to debug such a script.

Peter