Changes in current version - Added support for setting transfer timeout times through envirnoment variables, and config file settings (Peter Scott) - Fixed problem with broker registry not correctly handling deletion of duplicate objects (Hrvoje Stipetic) - Added support for relative URLs in client pull inforamtion, and for the HTML refresh directive. (Hrvoje Stipetic) - Change the content of the "description" attribute to be the first 255 characters of the objects "body" or "partial-text" attribute (Peter Valkenburg) - SGML summariser fixes - * to allow mapping of Dublin Core directives to attributes * to autmatically prefix meta generated elements with a custom prefix * to 'fix' meta generated attributes to only contain legal SOIF characters (Hrvoje Stipetic) - Added support for printing total number of items found when in perpage mode to nph-search (Hermann Straus) - Altered broker behaviour so that objects with no TTL value are never expired (Simon Wilkinson) - Fixed bug in gather which would result in incorrect HELLO messages being sent to servers (Dave Beckett) - Fixed broker bug where duplicate copies of a URL could remain in the broker in certain circumstances (Hrvoje Stipetic) - Added reporting of number of objects added to broker output (Hrvoje Stipetic) - Added broker version information to broker greeting message (Hrvoje Stipetic) - Fixed bug where admin password could be written to broker log (Hrvoje Stipetic) - Removed unnecessary print's in the nph-search script - Fixed parsing of broker configuration files so that whitespace is correctly handled in arguments (Peter J. Scott & Hrvoje Stipetic) - Added -V option to the broker to return the version number (Hrvoje Stipetic) - Fixed definition of random() in stor_man.c for better portability (Otis Gospodnetic) - Improved result ranking in nph-search for occurence of search string in URL and titles (Bruce R. Lewis) - Fix to LocalMapping code to remove problem where FD 0 (stdin) could be closed by mistake (Simon Wilkinson) - Fixes to SGML.sum to handle problems with HTTP-EQUIV and spaces in META NAME attributes (Hrvoje Stipetic) - Fix to broker where compression would unlink open files, and eventually the broker would run out of disk space (Hrvoje Stipetic) - Fixes to sys_errlist definitions for Linux compatibility (Tim Riker) - Fixed local file handling so that it doesn't hang if it tries to access a "special" file (Simon Wilkinson) - Fixes for glimpse internationalization - ISO_CHAR_SET has to be not only defined, but be non-zero (Marjan Erzen) - Added status code logging to liburl, so failed requests are logged with the reason for failure (Marjan Erzen) - Altered robots.txt code, so that an empty User-Agent exists every one may collect data (Marjan Erzen) - Fixed gather.c so that it will compile with Sun's CC, and so it doesn't redefine existing code (Marjan Erzen) - Added a new admin command to the Broker - 'parse- template' which will add a SOIF stream to the broker (Marjan Erzen) - Added a command line md5 tool (Marjan Erzen) - Altered Makefiles so that objects can be built outside of the current source directory when using a make that supports VPATH - such as gnumake. (Simon Wilkinson) - Added support for sending Accept: headers to http.c in liburl. Currently sends Accept: */* (Simon Wilkinson) - Fixed filter.c so both host and portnumber are made available for regular expressions (Wesley Alan Wright) - Added a SWISH based indexing engine, which is considerably more efficient than Glimpse, if slightly less fully featured (Simon Wilkinson) - Changes to nph-search.in so that the truncation warnings aren't erroneously displayed when the user hasn't supplied a value for maxobjflag (Allyn Fratkin) - Added support for META client pull redirections to HTMLurls (Hrvoje Stipetic) - Increased limits in HTML.decl (Tim Riker) - Fix for gather command's uncompressed transfer routine so that it now transfers blocks correctly in most circumstances (Marjan Erzen) [This patch went in in pl10 - but the ChangeLog entry was omitted] - Added N-at-a-time output to the nph-search CGI, so results can be paged. (David Hoekman) - Added noregex option to nph-search and to the broker Glimpse interface to allow the disabling of regular expressions in search patterns (Simon Wilkinson) - All Harvest binaries should now honour setting of prefix as being the install location, *when run normally*. Many still rely upon HARVEST_HOME to point them to the correct location, but this is provided by the support scripts which run them (Simon Wilkinson) - Added support for setting defaults in the query configuration file, so defaults values can be given for all parameters that are usually passed into the CGI from the Web search form (Simon Wilkinson) - Added support for sorting query results by rank to nph-search (Wesley Alan Wright) - Rewrote most of the Makefiles and altered the configure scripts so that the build process is more rational, and so that the targets are now "standardised". In addition a number of problems with the makefiles have been fixed, and all of the configure routines use one central cache file - speeding up the build process (Simon Wilkinson) - Corrected order of adding files to PATH so that Harvest binaries are now always used in preference to system ones (Vincent Winczewski) - Replaced BrokerAdmin.cgi with a perl script which accepts POST requests. This fixes the potential problem of passwords being sent in GET requests which were visibile in servers log files (Simon Wilkinson) - Fixed bug in robots.txt code which caused segmentation faults when reading incorrect files with empty User-Agent lines (Simon Wilkinson) - Updated nph-search so that it only sends nph- headers if invoked as an nph script, and so that it will automatically decode HTML entities. Removed BrokerQuery.pl (now symlink to nph-search) (Craig Counterman) - Added a new perl HTML summariser (Andy Powell) - Replaced all of the HSR tools with new, fixed versions supplied by Mic Bowman - Added changes to SGML.sum to make it compatible with nsgmls, part of SP. Error reports now on by default (Craig Counterman) - Altered broker/Glimpse/index.c so that case and word matching are disabled when an errorflag is provided (Craig Counterman) - Fixed broker/Glimpse/index.c so it compiles on an SGI (Simon Woods) - Altered http code so it now sends a user controllable User-Agent and Maintainer address which can be set in the config file (Hrvoje Stipetic) - Altered httpenum- breadth so that the count of objects retrieved is accurate (Hrvoje Stipetic) - Altered summarisers Makefile so that Pdf.sum is now installed by default (Simon Wilkinson) - Added -DAGREP_POINTER=1 to glimpse Makefiles - fixes segmentation faults with picky mallocs (Dan Riley) - Replaced HTMLparse with a newer version from the Mosaic 2.7b5 which handles comments correctly (Simon Wilkinson) - Changed Postscript.sum to a sh script, so stderr doesn't corrupt the generated SOIF (Simon Wilkinson) - Added some #includes for AIX compatibility (Simon Wilkinson) - Default url-filter now blocks VRML files (Craig Counterman) - Perl scripts which forced /usr/local/harvest changed (Craig Counterman) - Altered rfc1738 unescaping so it can cope with badly formed URLs (Simon Wilkinson) - Assorted changes for compilation under BSDI (Simon Wilkinson) - Fixed configure files so libraries are included in correct order for Solaris (Simon Wilkinson) - Fixed SGML.sum so that all debug output is sent to STDERR so it doesn't corrupt generated data (Simon Woods) - Fixed undef of SOIF associative array for compatibilty with Perl 5.004 (Bill Corley) - Fixed bug where a url wasn't closed in httpenum-breadth (Wolfgang Klimt) - Made broker more resilient to glimpse bugs (Simon Wilkinson) - Fixed commenting mistake in glimpse Makefile (Jeffrey Goldberg) - Changed broker config so numbers are now indexed by glimpse - Added missing ftp files ftp.pl and chat2.pl (Simon Wilkinson) - Fix for Year 2000 logging and file naming problem (logs should now be produced using 4 digits for the year) (Simon Wilkinson, suggested by Craig Counterman) - Fixed gatherers so they don't revisit pages that they've previously found to be broken (Simon Wilkinson, suggested by David A. Nowitz) - Fixed breadth first httpenum so it correctly handles META robots control tags (Simon Wilkinson) - Rewrote Troff.sum in perl (David A. Nowitz) - Added DOCTYPES and tidied up HTML generally (Craig Counterman) - Fixed × bug in ×tamp call (Craig Counterman) - Fixed assorted bugs in nph-search (Craig Counterman) Changes to v1.5 - Altered some lseek() calls for BSD compatibility (Martin Hamilton) - Removed optimisation from glimpse configuration (Simon Wilkinson) - Added -nocol option to broker (Dave Beckett) - Added virtual host support using the "Host:" header in HTTP/1.1 (Juha Laiho) - Added the HTML 3.2 ISOlat definition (Simon Wilkinson) - Added support for META Robots tags (Simon Wilkinson) - Altered broker / glimpse interface so that an attempt to use regular expressions on word boundries results on word matching being turned off, rather than an error message (Simon Wilkinson) - Fixed core dumps in httpenum when accessing pages requiring authentication (Simon Wilkinson) - Changed url_retreive behaviour so original caller gets notified of Redirect. This fixes a bug which made it possible to index a site and completely ignore their robots.txt file (Simon Wilkinson) - Added wildcards to Local-Mapping (Bruce R. Lewis) - Added support to the broker for terminating connections if the client goes away (Bruce R. Lewis) - Added depth first gatherer for both HTTP and Gopher enumeration. (Peter Scott) - Added hooks added to enum, prepurls and Gatherer script to support user selection of search technique via the Search variable (Simon Wilkinson) - Bug in depth first enumerator where a URL first encountered at a depth deeper than max depth isn't indexed when its seen at a lower depth fixed (Peter Scott) - Depth counts added to the depth first enumerator (Peter Scott) - Robots.txt checks added to the depth first HTTP and Gopher depth first enumerators (Simon Wilkinson) - Added code to grab URLs from HTML containing frames and client side image maps (Dean Marino) - Altered breadth first HTTP and Gopher enumerators so that the server containing the root page is marked as visited. (Julian Field) - Altered enumerators so URL count is increased for visited pages only (Ed Knowles) - Altered HTMLurls so HREFS can now contain new lines (Peter Scott) - Altered depth first enumerators to reduce number of temporary files generated (Simon Wilkinson) - Fixed enumerators to avoid buffer overflows with long URLs (Simon Wilkinson) - Added MAX_FILTERS env var to URL filter code (Simon Wilkinson) - Fixed url code to avoid unnecessary generation of symlinks (Bruce R. Lewis) - Added support for HTTP/1.1 servers to url code (Simon Wilkinson) - Fixed end of header detection bug in url code (Paul Johnson) - Fixed temporary file closing bug in url code (Bruce R. Lewis) - Altered Makefiles so realclean removes config.cache (Simon Wilkinson) - Altered robots.txt handling so it can cope with errors in server robots files. Also some performance improvements. (Ed Knowles) - Added support for user specified User-Agent strings to the robots.txt code (Simon Wilkinson) - Changed method of comparing User-Agent string against robots.txt file - now correct behaviour as per the robots specification (Simon Wilkinson) - Fixed bug where robots.txt code left file pointer unclosed (Ed Knowles) - Fixed bug where paths were compared case- insensitively in the robots.txt code. (Simon Wilkinson) - Altered Gatherer code so it converts ~ and %7E to %7e in Local- Mapping strings (Simon Wilkinson) - Added Locale option to Gatherer (Simon Wilkinson) - Added html32.dtd - the official Wilbur DTD - Fixed usage of fork in ps2txt wrapper (Peter Scott) - Fixed Harvest script so the RunGatherd scripts it creates include correct PATH (Simon Wilkinson) - Fixed Rainbow summariser so it rm -r's instead of unlinking - Altered files so they will compile under NetBSD (Martin Hamilton) - Altered configure scripts for NetBSD support (Dave Beckett) - Upgraded included version of glimpse to 4.0 - Fixed broker so the port picked for glimpseindex is less than 30000 (Peter Scott) - Added support for NOT operator to broker glimpse interface code - Fixed signal handling bug in glimpseserver (Simon Wilkinson) - Added warning for object truncated result sets to BrokerQuery.pl.cgi - Added nph-search CGI for displaying results "as they come" (Bruce R. Lewis) - Altered query cgis to escape quotes in $html_query (Simon Wilkinson) - Added explanation of the Log-Key directive to broker.conf (Dave Beckett) - Fixed bug in the broker lex file so searches containing an apostrophe will work Changes to v1.4.pl2: - Added "robots.txt" support to the gatherer enumeration. - Added "prefix =" to components/*/Makefile. - Changed Gopher timeout to 120 seconds. - Changed HTTP-Query byurl pattern to be any URL with a question mark. - Added HARVEST_NOT_VISITED_LOG env var to httpenum. - Added support for Glimpse-based broker to limit the number of matched lines per object. - Protect single quotes in Gatherer-Name. - Updated html-mcom.dtd - Changed BrokerQuery.pl to not use a tmpfile, and to sort the results by number of matched lines. - Fixed HTTP authentication to work with Netscape server, and support encoding spaces as RFC1738 escapes. - Fixed gatherd timeout bug (caused by eliminating the DNS mismatch warning). - Fixed Carriage-Return substitution in RTF.sum. - Changed SGML.sum to not do "word wrap" on very large strings. - Fix HTML-lax.sum to turn carraige returns to spaces. - Added --body-text option to HTML-lax.sum and in the comments of HTML.sum. - Fixed SGML.sum to NOT rewrite correct DOCTYPE declarations. - Changed our log() function to be called Log(). - Changed handling of depth between enum programs. - Changed broker connect timeout to happen only after some data has been read. - Added 'Access-Delay' to gatherer.cf. Now adds delay for LeafNode URLs. - Fixed 'gather' bug when reading binary data in non-compressed mode. - Configure: upgraded to v2.7 - BrokerQyery.pl.cgi: protect special characters in a Broker name. - Essence: Print [L] for URLs where local mapping succeeds. - Update the Users manual. Changes between release v1.4.pl1 (November 17, 1995) and v1.4 - Changes to the Gatherer - Fixed NULL BASE URL coredump bug in HTMLurls - Fixed Gatherer to make Top-Directory set Lib-Directory value also (like the manual says it does in section 4.6.1). - Fixed essence and SGML.sum to look in multiple lib dirs. Look first in Lib-Directory if set, otherwise in $HARVEST_HOME/lib/gatherer. - Changes to the Broker - Added <sys/select.h> for compiling on AIX. - Changed BrokerQuery.pl.cgi to send the query to the broker before opening the tmpfile. If there is a delay in opening the tmpfile the broker query could time out. - Fixed potential coredump in Log_rotate() due to large local array. ############################################################################## Changes between release v1.4 (November 10, 1995) and v1.3 - Changes to the Gatherer: - Added symbolic link loop detection to httpenum. - Added a GIF image summarizer (GIFImage.sum), requires netpbm. The GIFImage type is still in the Essence stoplist by default. - Added 'C' version of ftpget. - Added ability to rewrite the SOIF template URL with Essence post-processing. Could be used to gather file:// URLs and have them exported as http:// URLs. - Added the ability to specify a program to generate root/leaf URLs. - Fixed select() timeouts to POSIX semantics. - Fixed SGML summarizer to give error if input is empty. - Fixed a Makefile to actually build and install HTML-lax.sum. - Fixed liburl problem with AFS. Must *copy* files into the cache-liburl directory. - Fixed News gatherering: If 'newsget.pl' exits non-zero, close the NNTP server socket. - Fixed newsget.pl with a major rewrite. - Fixed 'fileenum' to use URLs and not always return file://hostname/. - Fixed gatherd bug where child process would remove parent's gatherd.pid file. - Changed NewsArticle.sum TTL to 7 days by default. - Changed Essence unnesting to occur in individual directories. - Removed confusing gatherd DNS mismatch warning message. - Changes to the Broker: - Added #Restart-Index-Server command to broker admin command set. - Added error logging and debugging in Glimpse inline query code. - Fixed select() timeouts to POSIX semantics. - Fixed Glimpse minor malloc problems. - Fixed the broker on Linux; needs unbuffered input from gather process. - Fixed broker query language bug for high-bit (international) characters. - Changed Broker to allow specifically setting GlimpseServer_Port again; if not set, port is chosen randomly. - Changed BrokerAdmin.cgi to use unbuffered output. - Changed Glimpse macros CLEANUP and RETURN to be functions. - Changed broker admin/LOG to log FQDN instead of IP address. - Remove glimpse version ambiguities in Glimpse/index.c. - Removed getpeername() call in the broker; get address from accept(). - Changes to the Cache: - The cache has been moved to a separate distribution. - Miscellaneous Changes - Dont link with -lmalloc on Solaris. - Fixed User Manual and FAQ inconsistencies. ############################################################################## Changes between release v1.3 (September 7, 1995) and v1.3.beta: - Changes to the Broker: - Added support for auto-validation from the HSR which includes a description.html file, RunUpdate program for each new Broker. - Changes to the Cache: - Added support to dynamically toggle debug level via USR1 and USR2. - Fixed dnsserver parsing numeric addresses. - Added patches for FreeBSD. - Changed source_ping to off by default. - Added optional code for 'local_ip' line in cached.conf. Addresses given as 'local_ip' will be retrieved directly, without sending any probe packets. - Added 'TIMEOUT_DIRECT' as a new kind of entry in cache_hierarchy.log. - Changes to the Gatherer: - Added LMT.gdbm to liburl to keep last-modified-timestamps. - Added support for using BASE element in HTML enumeration. - Added support for HTML-3.0 DTD. - Added support for Netscape DTD. - Added support for HotJava DTD. - Added old HTML.sum as HTML-lax.sum. - Added MacBinHex as a supported nested type in essence. - Changed gatherd to die when its data directory gets removed. - Fixed bug: repeated HTTP redirected URLs (with help from glenn@rockie.nsc.com) - Miscellaneous Changes - Incorporated fixes for FreeBSD port from ted@oz.plymouth.edu. - Incorporated fixes for Ultrix port from dsr@lns598.lns.cornell.edu. ############################################################################## Changes between release v1.3.beta (August 7, 1995) and v1.2: - Changes to the Broker: - Upgraded to Glimpse 3.0. - Improved and updated WAIS, Inc. support to use version 2.1.1. - Added support for Verity VDK as backend indexer/searcher. - Added support for GRASS GIS as spatial database. - Added support for PLS, Inc. PLWeb as backend indexer/searcher. - Added IP numbers for incoming requests to log information. - Added support for displaying individual SOIF attributes via WWW. - Added 'Uniqify' command to Broker; keeps most current object of duplicate URLs. - Added security and name lookup to BrokerQuery.pl. - Added support for Glimpse inline queries. - Added error message to report incorrect WWW installation. - Added some support for Internationalization in the Broker - Added support for automatic validation by HSR. - Removed need for 'gzip' in the Broker. - Changed BrokerQuery.pl to try multiple entries from Brokers.cf - Changed broker to read queries with a timeout. Very long queries can get segmented by TCP. - Fixed bug with matching Description attributes. - Fixed bug with Glimpse regular expression detection. - Fixed bug in CreateBroker -- wrong default Gatherer port number. - Changes to the Cache: - Added persistent disk storage across cached reboots. - Added IP-based access control. - Added setting of the TTL based on URL regular expressions. - Added more sophisticated setting of the TTL based on HTTP headers. - Added more statistics information. - Added support for logging using the common httpd logfile format. - Added support for HEAD HTTP request method. - Added support for user-configurable periodic garbage collection. - Added support for user-configurable stoplist. - Added support for WAIS proxy'ing (from Edward Moy, Xerox PARC). - Added support for quick aborting when client drops connection, cached stops immediately. Useful for slow network links. - Added high/low water marks for disk storage. - Added 'source_ping' to cached.conf. - Added 'dns_children' to cached.conf. - Added -z to force a cached to discard (zap) its disk storage. - Added logging of ftpget.pl failures (exit codes and signals). - Added Expires timestamp to cache log - Improved error messages for DNS name lookup failures. - Improved performance of LRU replacement policy. - Improved performance for generating statistics. - Increased listen(2) socket queue size to 50 or max of OS. - Removed all Tcl code. - Cleaned memory allocation and management. - Cleaned up and updated cached.conf. - Cleaned up debugging output. - Changed default low watermark to 60%. - Changed trace mail into cached.conf option. - Changed algorithm for time estimations using echo ports. - Changed dnsserver to try gethostbyname(3) again sometimes - Fixed bugs with URL intepretation. - Fixed bugs with internal IPcache memory management. - Fixed bug with DNS lookups on IP numbers. - Fixed bug with not finding 'dnsserver'. - Fixed bug with hard timeouts in select loop. - Fixed bug with some platforms needing strdup(). - Fixed bug with ftpget.pl not including MIME content-type for unknown filename extensions. - Fixed bug with ftpget.pl not parsing ls output correctly (wasn't matching dashes in user/group names). - Fixed copyright messages in source code. - Fixed realloc() bug for concurrent object access. - Fixed bug when neighbors specified and dns_servers != 3. - Fixed bug with new hash tables when deleting from table as it is being traversed. - Fixed various minor bugs. - Changes to the Gatherer: - Added ability to pass enumerated URLs through an external filter program. Allows very specific selection of URLs to further enumerate. - Added -background flag to the Gatherer; does export work in bg. - Added IP-based filtering (regular expressions) in host-filter - Added Post-processing of summaries to Essence - Added 'gather' check for 'gzip' before setting compression option. - Added username/password support for HTTP retrievals - Changed gatherer to remove cache-liburl directory after a successful gather session. - Fixed bug: Infinite loops in 'enum' on Invalid URLs - Fixed bug: HTTP headers not parsed from slow servers - Improved URL parsing; support for username/password in FTP urls. - Miscellaneous Changes - Upgraded autoconf 'configure' scripts to v2.4. - liburl: better handling of relative URLs. - liburl retrieval programs abort very large transfers (at 10 Mbytes) - Fixed bug with subscribing to harvest-users mailing list. ############################################################################## Changes between release v1.2 (April 3, 1995) and v1.1: - Changes to the Broker: - Major performance improvements to the collector interface. - Added fast, efficient internal Gatherer ID management. - Added support for clients requesting attributes with #attribute. - Added support for log file rotation, and terse logging. - Added support for #operation in query manager interface. - Cleaned up the log file format. - Cleaned up the administrative interface. - Cleaned up the UNIX file system-based storage manager. - Fixed major bug with WAIS support. - Fixed file descriptor leaks in glimpseserver when the index contained files that had since been deleted. - Fixed bug with overflowing lines from glimpse. - Fixed bug with hostname initialization. - Fixed memory leak with the Description-Tag attribute matching. - Fixed various minor bugs. - Changes to the Cache: - Added httpd accelerator support. - Added IP number logging. - Added setuid() to a user when cached is run as root. - Added support for HTTP servers that die abruptly. - Added client_timeout which places a hard limit on the life of incoming connections on the ascii port, or on outgoing HTTP or Gopher clients. - Cleaner implementation for retrieving FTP URLs via ftpget.pl. - Tries to write cached.pid file in same directory as cached.conf. - Changed FTP support to sacrifice correct HTTP headers for dramatically decreased latency for large FTP objects. - Fixed ftpget.pl -htmlify to determine directory vs. file correctly and send HTTP header as soon as possible. - Fixed rare core dump during HTTP xfers. - Fixed how the error messages are printed. - Better support for larger file descriptor tables. - Debug level 0 and 1 now has timestamp logged. - Cleaned and updated defaults for cached.conf. - When run as root and do suid, cached will change current directory to its swap directory. Swap directory is pretty sure that writable to cached. Just in case, it crashes so it can write core file. - Minor modification of store error message. - Remote client connection resets are handled as soft error. - Strip an extra /r/n from MIME. - Hierachy log (yet another log, but it's optional). - Periodically hunts for zombies processes. - Added more information to the stat interface. - Cleaned up info data for improved parsability/readability. - Changes to the Gatherer: - Added support to follow HTTP redirection pointers. - Added support for $http_proxy environment variable in liburl. - Added support for summarizing SGML data. - Added better support for summarizing TeX data. - Added support for summarizing RTF and MIF data, using Rainbow software provided by EBT, which we make available in our new components distribution - Added support for summarizing WordPerfect 5.1 data. - changed HTML summarizing to use SGML summarizer, providing more easily customizable results - Added support for local filesystem gatherering for NNTP. - Improved incremental gatherering support, and integrated the support into the Essence program (removed dbcheck program). - Added support for "fake" MD5 generation per SOIF object on external presentation unnesting streams (exploders) -- permits incremental gathering on data generated by an Exploder. - Added --memory-efficient to Essence to trade time for memory efficiency; this help users who have limited with memory resources but are dealing with large SOIF objects. - Added --confirm-host to Essence for explicit host DNS validation. - Added --max-refresh to Essence to limit refreshing activity. - RootNode enumerators generate RFC 1738 escaped URLs. - Improved performance of SOIF parsing. - Fixed bug in locating gzip in gatherd. - Fixed bug in the unnesting commands in Essence. - Fixed bug with HTTP/1.0 requests, now sends encoded URIs for GETs. - Fixed ftp.pl for Solaris. Wasn't setting PF_INET correctly. - Changes to the Replicator: - Updated with USC's version from 3/15/95 - Changes to the User's Manual: - Added sections for new plug'n'play components: standard, SGML, HTML, MIF, RTF, WordPerfect 5.1. - Updated support policy. - Added clarification in Local Gatherering section. - Added clarification in RootNode enumeration section. - Added clarification on Gatherer/Broker information flow. - Added clarification for some cached internals. - Added section on upgrading from v1.1 to v1.2. - Added discussion about httpd_accel for cached. - Updated info about software for the replicator section. - Updated numerous facts to v1.2. - Reorganized essence/content extraction customization section. - Added description of SGML summarizing and components distribution (including Rainbow software for MIF and RTF formats) - Added more troubleshooting comments to all sections. - Added more detail to cache and replication sections, including discussions of httpd-accelerator, CreateReplica, and some of the performance and failure-mode characteristics of the cache. - Cleared up inaccuracies and unclarities in Gatherer RootNode specification section. - Added notes about user-contributed software. - Updated support policy. - Added index entries for all programs in appendicies. - Other minor changes. - Miscellaneous changes: - Reorganized the source tree to support plug'n'play components. ############################################################################## Changes between release v1.1 (February 17, 1995) and v1.1.beta.v2: - Changes to the Broker: - Added a leading protocol version header for the result set. - Added support for query flags during Broker-to-Broker collections. - Added support for limiting the lifetime of glimpse queries. - Fixed major bugs in Broker-to-Broker collections. - Fixed major bugs with deleting Registry entries during initial build. - Fixed memory leaks and file descriptor mgmt bugs in glimpseserver. - Fixed bug with -L in glimpseserver. - Fixed bug that increased the size of structured glimpse indexes. - Fixed bugs in the administrative interface and WAIS support. - Fixed core dump when searching the Registry during collections. - Fixed display SOIF links flag in BrokerQuery.pl. - Fixed .cgi pgms, so that httpd kills the cleanly after user abort. - Changed glimpseserver and broker so that they will not block longer than 15 seconds while waiting for an incoming connection. This prevents SunOS from blindly swapping out the process. - Optimized so that a full glimpseindex will only happen if more than 10% of the objects have changed. - Added some more logging output. - Fixed various minor bugs. - Changes to the Cache: - Added Gopher->HTML support. For mosaic proxy, you'll need to set gopher_proxy http://cache.server:3128/ instead of set gopher_proxy gopher://cache.server:3128/ - Fixed bug with HTML-ify FTP directories using ftpget.pl. - Fixed bug with hierachical problem for refreshing. - Fixed bogus client error message. - Improved cached error messages. - Changes to the Gatherer: - Generates the 'Description' attribute whenever possible. - Fixed bug in the expiring of objects from the PRODUCTION database. - Fixed bug in httpenum that wasn't cleaning up correctly. - Fixed newsenum to obey URL-Max limit. - Improved the Mail summarizer. - Improved the USENET support, added NewsArticle and NewsGroup. - Improved gatherd to speed up SEND-UPDATE timestamp computation. - Improved preparation for the Gatherer's database to be exported. - Purify'd Essence to remove memory leaks. - Changes to the User's Manual: - Updated the section on the Broker's Collection.conf file. - Updated many minor points. - Improved HTML version of the manual, by upgrading latex2html pgm. - Miscellaneous changes: - Fixed problems with Solaris' socket.ph for Perl programs. ############################################################################## Changes between release v1.1.beta.v2 (February 3, 1995) and v1.1.beta: - Changes to the Broker: - Major performance improvements while doing collections. - Uses the customizable BrokerQuery.pl for the WWW interface. - Fixed major bugs in Broker-to-Broker transfers. - Fixed minor bug in collections that caused necessary indexing. - Cleaned and improved the information that is logged to broker.out. - Changed broker to run cleanly as a daemon by disconnecting from the controlling terminal. - glimpseserver now prints its error messages correctly. - Fixed various minor bugs. - Changes to the Cache: - Fixed core dump bug when cached is heavily loaded. - Improved error messages. - Changes to the Gatherer: - Site enumeration filter is based on host:port, and better argv processing for 'Gatherer' - fixes by "Albert Dvornik" <bert@MIT.EDU> - Major performance improvements while preparing databases. - Fixed Gatherer to change to Top-Directory before running. - Fixed Gatherer to write dummy index.html files in data/ and tmp/. - Fixed bug in HTTP enumeration to only extract links from HTML. - Fixed various minor bugs. - Changes to the User's Manual: - Added detailed appendix on Harvest software layout and programs. - HTML version of the manual now contains the local copy of the icons. - Added section on customizing BrokerQuery.pl. - Fixed example for Filters during RootNode enumeration. - Added a search interface to the User's Manual using a Broker. - Updated index. - Miscellaneous changes: - Improved log output format to be more readable. - Added HP-UX port/fixes from Chris Dalton (crd@hplb.hpl.hp.com). ############################################################################## Changes between release v1.1.beta (January 26, 1995) and v1.0: - Changes to the Broker: - Upgraded to Glimpse 2.1 which includes glimpseserver. - Added faster, more memory-efficient internal Registry lookups. - Added support for switching the indexing subsystem at run-time. - Added a statistics generator for the Broker. - Fixed BrokerQuery.cgi so that the rejection message from the Broker while its doing indexing works all of the time. - Fixed Broker bug that would cause the Broker to hang sometimes on a pclose() after doing a collection with the gather command. - Immediately denies outside connections during a collection, indexing, or other administrative operations. - Improved the HTML result set generated by BrokerQuery. - Pointers to content summaries in the result set is now an option. - Changed /brokers to /Harvest/brokers, etc. - Limit the time that the Glimpse search engine runs for a query. - Added Query.cgi which can be used to support Broker replicas. - Added support for minimal bookkeeping from Gatherer. - Fixed problems with the Broker's cleaning, added compress Registry. - Fixed problems with the Broker's updating of objects. - Fixed BrokerQuery syntax error message to point to queryhelp.html. - Fixed BrokerRestart for Replicator interface. - Fixed WWW interface to work with any document root. - Fixed various minor bugs. - Changes to the Cache: - Fixed serious hierachical cache bug. - New error messages. HTTP/1.0 compliant. - Nuke If-Modified-Since to work with Netscape. - Non-blocking DNS lookup using dnsserver program. - New config parameter, cache_dns_program. - Removed Tcl library binaries - have a precompiled version of Harvest. - Fixed stat for outgoing message. - Use multiple directories for on-disk swap storage. - Changes to the Gatherer: - Added flexible support for specifying a Gatherer's workload. - Added support for gatherering through the local file system. - Added support for USENET URLs. - Added INFO command to Gatherer for statistics. - Added support for generating minimial bookkeeping attributes. - Improved HTTP/1.0 support for MIME headers and Last-Modified headers. - Fixed bug with 'gather' that caused 'gunzip' decompression to fail. - Made automatic keyword generation, and local disk cache maximum size a run-time flag. - Added a SOIF parser in Perl. - Changed HTML URL extractor from HTML.sum to separate program. - Fixed Gopher support to have longer read timeout. - Consolidated GDBM utilities into the 'gdbmutil' program. - Fixed bug with gatherd leaving zombie children. - Fixed various minor bugs. - Changes to the Replicator: - Replaced with USC's Replicator distribution. - Changes to the User's Manual: - Added a new subsection on Extended RootNode Specifications - Added discussion about new Local-Mapping support - Fixed various typos and clarified wording in various places - Fixed some URLs, and added others - Fixed the discussion on using Glimpse with the Broker. - Added a new subsection the Perl SOIF library. - Added more descriptions about various system components (e.g., HSR) - Added more index entries, and clarified some of the existing entries - Added a note about realtime Gatherer updates - Added mention of cache RAM requirements - Added section on Support Policy and Harvest Team Contact Information - Updated copyright/licensing discussion - Added a section about the binary-only distribution - Changed section names and content at beginning to make it more clear and to make more sense with the new installation. - Reorganized manual by subsystem - Added troubleshootings sections to each subsystem, and shifted some stuff into there that had been in other places - Expanded section on supported platforms and software needed for running/building Harvest - Clarified some parts of the ``Querying a Broker'' section - Added appendix on Directory layout of installed Harvest software - Updated to reflect new httpd reorg - Updated default summarizer action list - Noted that glimpseserver is now part of the system - Added more discussion to replicator section, including a figure - Miscellaneous changes: - Reorganized Harvest's installed directory structure. - Integrated port to AIX 3.2 and AIX cc by greving@dv.go.dlr.de. - Integrated port to HP-UX A.09.03 by steff@csc.liv.ac.uk. - Integrated port to IRIX 5.3 by leclerc@ai.sri.com. - Integrated port to Linux 1.1.59 by hardy@cs.colorado.edu. - Integrated port/fixes to HP-UX 09.03 and HP ANSI C compiler A.09.69 by crd@hplb.hpl.hp.com. - Changed all Perl scripts to work under Perl 4.x or 5.0. - Try to use vfork rather than fork to save memory when possible. - Updated Copyright. ############################################################################## Changes between release v1.0 (November 7, 1994) and v1.0-beta-1.5: - Changes to the Broker: - Upgraded Glimpse from version 1.1 to 2.0. - Added support for Glimpse 2.0 which allows byte-level indexing, limiting result set sizes, arbitrary Boolean queries, and more. - Made case insenstive and word matching the default for Glimpse. - Improved and updated queryhelp.html and adminhelp.html. - Added soifhelp.html to the help suite. - Added a reboot-broker tag to the default broker Makefile. - Fixed various minor bugs. - Changes to the Gatherer: - Better HTTP/1.0 support, sends User-Agent and From fields. - Fixed a problem with cross-site Gopher RootNode enumeration. - Fixed bug in HTTP RootNode enumeration. - Generation of unique, sorted keyword list is optional in config.h. - Changed Gatherer program to work around Solaris 2.3 Perl 4.036 bug. - Fixed various minor bugs. - Changes to the Cache: - Added support for the Netscape browser. - No longer caches /cgi-bin/ URLs. - Updated the Tcl/Tk/dpwish pointers for the Cache manager. - Changes to the User's Manual: - Added an index with over 300 entries. - Added a new section about Querying a Broker. - Added a new section about common SOIF attribute names. - Added a new section on periodic gatherering. - Added a new section on tuning Glimpse. - Added a new section on the WWW interface to the Broker. - Added a new section on integrating new search/indexing subsystems into the Broker, and give detailed interface description. - Added more detail to SOIF appendix. - Improved and updated the Administrating a Broker subsection. - Added more explanation about manual annotations. - Folded in content from FAQ. - Noted particular usefulness of the Essence-Options variable, e.g., for setting --full-text. - Added a note to the Customizing the candidate selection step subsection that it's particularly useful to do section based on file and URL naming heuristics when gathering remote data, because it can avoid retrieving lots of data. - Added a note in the subsection on Running a Gatherer that you can set MAX_ENUM in src/common/include/config.h, and that a future release of Harvest we will make it possible to set this limit more flexibly. Also noted about the robot guidelines. - Added an overview about the lib and bin directories for the Gatherer, including the defaults and descriptions of each file. - Showed RunGatherer and RunGatherd scripts and added discussion of how to use them from cron and /etc/rc.local. - Added pointer to FAQ on setting up HTTPD in the Broker section. - Put the logo on the cover page. - Miscellaneous changes: - Updated the COPYRIGHT and added it to all appropriate source files. - Updated the FAQ, and converted to HTML. - Fixed BSD compatability bug in src/install.sh. ############################################################################## Changes between release v1.0-beta-1.5 (October 14, 1994) and v1.0-beta-1.4: - Added a user manual that is intended to help both novice and advanced Harvest users better use the system. It covers the following topics: - Introduction to Harvest (1 page) - Subsystem Overview (2 pages) - Getting and Installing the Harvest software (1 page) - Making Basic Use of Harvest (3 pages) - Advanced Features of Harvest (5 pages) - References (1 page) - Appendix on The Summary Object Interchange Format (SOIF) (3 pages) - Appendix on Essence Summarizer Actions (1 page) - Appendix on Gatherer Examples (6 pages) - Appendix on Broker's Query Manager and Collector Interface (2 pages) - Changes to the Broker: - Improved Broker installation, and added the CreateBroker program that automatically creates and configures a Harvest Broker based on a brief Question & Answer session with the user. - Improved the Mosaic interface to be more user-friendly. - Added support for duplicate removal based on MD5 values. - Made Query Manager and Administrative interface more extensible. - Rewrote the Broker registry to improve performance and readability. - Added the dumpregistry command to view the Broker's registry. - Added the test-broker command for simple testing of a Broker. - Added support for wais-8-b5, freeWAIS, commerical WAIS, and Nebula. - Cleaned up the admin.html and query.html files. - Cleaned up much of the code to make more extensible. - Fixed bug in the registry garbage collection. - Fixed major memory leak bugs. - Fixed various minor bugs. - Changes to the Cache: - Started using icp version_id 2 of the protocol. - Improved support for OSF/1 v2.0 on 64-bit DEC Alphas. - Added password support for administrative interface. - Fix bug with FTP "Parent Directory", and cleaned up HTML for dirs. - Fixed various major bugs with hierarchial caching. - Fixed various minor bugs. - Changes to the Gatherer: - Added support for generating a sorted, unique keyword attribute, based on the Descripton, Partial-Text, or Keywords attribute. - Added an "allow only these types" in the Candidate Selection step. - Added stub Exploder type to help users use the unnesting step. - Gatherer automatically creates a gatherd.cf file if needed. - Fixed major gatherd bug that caused. - Fixed various minor bugs and memory leaks. - Changes to the Replicator: - Working on instrumenting the code to measure peformance. - Fixed various bugs. ############################################################################## ChangeLog,v 1.214 1996/02/01 06:35:49 duane Exp