[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Q]: Intranet tools



WAIS knows about exclusion lists, and it can be told to batch the indexing
of the files of the same kind (with a simple script), and then merge the
indexes. It is very flexible. When indexing compressed files, they must be
uncompressed first, but must not be kept like that. Once the contents are
in the index, the file can be compressed again. Trickery with soft links
allows the document to be pointed to correctly in despite of the fact that
its compressed name has changed (the trickery is used when the document is
uncompressed: A soft link is made to the uncompressed version, bearing the
name of the compressed version. Then, when the document is re-compressed
and stored, the index contains the correct name).

There is a lot of scripting to be used to make it work.

That issue about serving documents converted into HTML on-the-fly, is
something even the big search engines don't do. Yahoo has a lot of
pointers to pages that are not HTML at all, plain text, PDF and whatnot,
and there is no conversion utility online.

Most indexers handle a limited set of document types. Handling ps, HTML,
SGML and plain text is no problem. I think that the guys who made Linux
source browsable used something like that to index C functions and other
stuff. I used an awk script to do the same (index C source tree).

If you want to beat people like altavista and Yahoo on a small machine (even
dedicated), you are not very realistical.

The version of Linux I was talking about, is from Lasermoon, UK, and it is
*OLD* (1.2.13 - 1996). I got a demo on a CD. It used to work on a 'live'
fs at reasonable speed on 486DX with 4xCD. Full text index of all of the
man pages and all the HOWTOS and docs if I remember well. All in HTML,
using the NCSA server and Chimera as browser, all of this on the same
machine. Real cool. There was no way to add docs on-the-fly.

For a place that says a couple of things about how to do it if you have a
University budget and more behind you, take a look at this:

http://photo.net

And follow the links. (HINT: Turn 'auto load images off' before entering
if surfing by modem. The 'photo' is for 'gigabytes of graphics in-line').

P.