HP UX Archive Centre

The original version of dw (07-May-1991) in lex used a pattern
[A-Za-z_][A-Za-z_0-9]* to identify a word, and a pattern [^ \t\n]+ to
identify a non-word, non-white token.

This fails to recognize the input "foo foo!bar" as a doubled-word
instance because "foo!bar" matches the second pattern, and is longer
than the match with the first.

The change is therefore to set the second pattern to [^A-Za-z_ \t\n]+.
The above example then is tokenized as "foo" "!" "bar", and the
doubled word is recognized.

Word sequences like this arise in TeX documents with MakeIndex index
entries: "gnus\index{gnus!sub-Saharan}" input to detex produces
"gnus gnus!sub-Saharan".

To provide dw support on machines that lack lex or flex, a standalone
version in C has been prepared in dw2.c.  It is functionally identical
to the lex-based dw.l, and its executable can be renamed dw.  There is
no noticeable performance difference between the two versions.

An awk version has also been prepared that can be used with gawk or
nawk; the original awk doesn't have enough capabilities to do the job.
It is about 10 times slower than the C version, and would be even
slower if I added code to match the output of dw.  "tr A-Z a-z |
[gn]awk -f dw4.awk" splits the input lines into lower-case words
separated by characters that match "[^A-Za-z0-9_]".  This then
erroneously reports that "J. J. Smith" has a doubled word "j"; dw
would not make this error.