Tags:
create new tag
, view all tags

Kino Search Engine Add-On

KinoSearch, A Perl search library, implementation of Lucene (Java implementation) search engine. This is the base of this indexed search engine for TWiki. With KinoSearch you create an index over all webs including attachments like Word, Excel and PDF. Based on that you get a really fast search over all topics and the attachments. You need this add-on

  • if your TWiki has grown so big, that normal search is too slow or
  • if you want to do search not only on the topics but also the attachments.

Screenshot of a search results list

KinoSearchResult.jpg

Usage

See the KinoSearch topic for a quick Usage and setup.

Searching with kinosearch

The kinosearch script uses a template kinosearch.pattern.tmpl (if you use the pattern skin). There is also a KinoSearch topic with a form ready to use with the kinosearch script.

If you have enabled the SearchEngineKinoSearchPlugin, you can use the rest handler either from a URL (this works only for a smaller TWiki), or the command line. The syntax is identical to the kinosearch script.

The following form submits text to the kinosearch script. The installation instructions are detailed below.

| Help

KINOSEARCH{...} -- search indexed topics and attachments

(experimental) Integrating KinoSearch into TWiki's internal SEARCH

integrated SEARCH results

By setting $TWiki::cfg{RCS}{SearchAlgorithm} = 'TWiki::Store::SearchAlgorithms::Kino'; (a setting in the Store settings section in configure), TWiki will use the KinoSearch index for any inbuilt search (including WebSearch) that it can (for regex searches it will fall back to the Forking search algorithm).

If you want TWiki's WebSearch to also show you attachment results (when you select the 'Both body and title' option), you need to also set {SearchEngineKinoSearchAddOn}{showAttachments}=1, and add kino to the front of your SKIN setting.

The reason this feature is experimental, is that kinosearch does not do partial matching, so searching for TAG will not match text like %TAG{"something"}%, only instances where the word TAG is seperated by whitespace. TWiki's SEARCH expects total partial matching.

Query syntax

  • To search for a word, just put that word into the Search box. (Alternatively, add the prefix text: before the word.)
  • To search for a phrase, put the phrase in "double quotes".
  • Use the + and - operators, just as in Google query syntax, to indicate required and forbidden terms, respectively.
  • To search on metadata, prefix the search term with field: where <field> is the field name in the metadata (for instance, author).

NOTE: KinoSearch tries to split the single words from composed things. Thus it reads "something-combined-together" as three words: "something combined together". The same is true for combinations with underscore. Thus "something_with_underscore" will be treated as "something with underscore". This feature is extremely usefull, as you can search for the single words and need not know the complete word (Note: KinoSearch has no possibility to search with wildcards!). But of course you need to know about it. If you want so search for "something-combined-together", you need to search for "something combined together". If you add also the " to the search string, you are sure, that the three words are in that order one after the other.

Query examples

  • text:kino or just kino
  • text:"search engine" or just "search engine"
  • author:MarkusHesse — note that to search for a TWiki author, use their login name
  • form:WebFormName to get all topics with that form attached.
  • CONTACTINFO:MarkusHesse if you have declared CONTACTINFO as a variable to be indexed
  • type:doc to get all attachments of given type
  • web:Main to get all the topics in a given web
  • topic:WebHome to get all the topics of a given name
  • +web:Sandbox +topic:Test to get all the topics containing "Test" in their titles and belonging to the Sandbox web.

Note: The current version of KinoSearch does not support wildcards.

Indexing

Note: The kinoindex, kinoupdate and kinosearch scripts will be deprecated over time in favour of the restHandlers, both for security reasons, and to make compatibility with TWiki 5.0 easier.

Creating a new Index

Each topic's text body, title, form fields and attached documents are indexed.

You should run this script manually after installation to create the index files used by KinoSearch. You can also schedule a weekly or monthly crontab job to create the index files again, or execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

  • cd twiki/kinosearch/bin ; ./kinoindex

If you have enabled the SearchEngineKinoSearchPlugin, you can use the rest handler either from a URL (this works only for a smaller TWiki), or the command line

Updating your Index

The kinoupdate script uses the web's .changes files to know about topic modifications. Also, a .kinoupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, it first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again.

  • cd twiki/kinosearch/bin ; ./kinoupdate

If you have enabled the SearchEngineKinoSearchPlugin, you can use the rest handler either from a URL (this works only for a smaller TWiki), or the command line

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

# m h  dom mon dow   command
35  *  *   *   *     cd /path/to/you/twiki/bin ; ./rest SearchEngineKinoSearchPlugin.update

You can also optionally use SearchEngineKinoSearchPlugin's updateHandlers to automatically update the index whenever a topic is modified (or an attachment uploaded) by setting {SearchEngineKinoSearchPlugin}{EnableOnSaveUpdates} to true in the Extensions section of configure. Warning this can cause topic saves and attachments to become unacceptably slow, as the index update happens before the browser operation has completed.

Attachment file types to be indexed

All the PDF, HTML, DOC, XLS and text attachments are indexed by default. If you want to override this setting you can use a TWiki preference KINOSEARCHINDEXEXTENSIONS. You can copy & paste the next lines in your Main.TWikiPreferences topic

   * KinoSearch settings
      * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .html, .txt, .doc, .xls, .docx, .pptx, .xlsx,
or whatever extensions you want. If you add other file extensions, they are treated as ASCII files. If needed, you can add more specialised stringifiers for further document types ( see Indexing further document types).

Indexing of form fields

All form fields are indexed. For this, the form templates are checked and the included fields are indexed. Additionally the name of the form of a topic is stored in the field form_name. How to search for this is described below.

Note: With kinoupdate only the form fields that existed at the time the initial index was created are indexed. Thus if you add a form or if you add a new field to an existing form, you should create a new index with kinoindex.

Indexing further document types

The indexing of attached documents is realised in two steps:

  1. the content of the document is changed to an ASCII string. This is called stringification.
  2. this ASCII string is indexed with KinoSearch. This is the normal way in all index applications.

To index different types of documents, it is necessary to have specialised stringifiers, i.e. classes to extract the ASCII text out of the document. In this add-on, a plug-in mechanism is implemented, so that additional stringifiers can be added without changing the existing code. All stringifier plugins are stored in the directory lib/TWiki/Contrib/KinoSearch/StringifierPlugins.

You can add new stringifier plugins by just adding new files here. The minimum things to be implemented are:

  • The plugin must inherit from TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase
  • The plugin must register itself by __PACKAGE__->register_handler($application, $file_extension);
  • The plugin must implement the method $text = stringForFile ($filename)

Then you should add to the list in KINOSEARCHINDEXEXTENSIONS in TWikiPreferences. Now the defined document type should be indexed and the new stringifier should be used.

NOTE: If you just extend the list without having a special stringifier in place, this document type is treaded like an ASCII file. For binary document types, this may lead to problems (inpropper search results, long indexing times and potential indexing break downs).

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

Backend for indexing Word documents

Install a backend to stringify Word documents if you want to index Word documents. For this either install antiword, abiword or wvWare.

Note: This add-on comes with stringifiers for all three of them. Depending on what is installed, the right stringifiers is used.

Note2: If you install more than one of the three backends, you must select one of them in the configure {SearchEngineKinoSearchAddOn}{WordIndexer} setting.

Note2: If you do not install any of the mentioned backends, you should remove .doc from the KINOSEARCHINDEXEXTENSIONS variable.

To install antiword for Debian you can:

  • apt-get install antiword

To install abiword for Debian you can:

  • apt-get install abiword

To install wvWare for Debian you can:

  • apt-get install wv

Backends for PDF, PPT

Install xpdf and ppthtml, if you want to index attached PDF and PPT files:

  • For Debian you can use apt-get install xpdf-utils ppthtml
  • If you do not install xpdf, you should remove .pdf from the KINOSEARCHINDEXEXTENSIONS variable.
  • If you do not install ppthtml, you should remove .ppt from the KINOSEARCHINDEXEXTENSIONS variable.

Backends for DOCX, PPTX

Install docx2txt.pl (http://docx2txt.sourceforge.net/) and pptx2txt.pl (http://pptx2txt.sourceforge.net) at appropriate paths.  By default, in most of linux/unix system, the tool can go into ==/usr/bin directory.

Installation of additional CPAN modules

You need to install the following modules: KinoSearch, File::MMagic, Module::Pluggable, HTML::TreeBuilder and Spreadsheet::ParseExcel

You can do that by running:

perl -MCPAN -e "install KinoSearch"
perl -MCPAN -e "install File::MMagic"
perl -MCPAN -e "install Module::Pluggable"
perl -MCPAN -e "install HTML::TreeBuilder"
perl -MCPAN -e "install Spreadsheet::ParseExcel"
perl -MCPAN -e "install CharsetDetector"
perl -MCPAN -e "install Encode"
perl -MCPAN -e "Spreadsheet::XLSX"
perl -MCPAN -e "Text::Iconv"

Note for Windows: For Windows, make sure you have a C-compiler in place. This is normally part of Visual Studio etc.

Installation of the add on itself

Like many other TWiki extensions, this module is shipped with a automatic installer script written using the BuildContrib.

  • If you have TWiki 4.1 or later, you can install from the configure interface (Go to Plugins->Find More Extensions)
    • The webserver user has to have permission to write to all areas of your installation for this to work.
  • If you have a permanent connection to the internet, you are recommended to use the automatic installer script
    • Just download the BuildContrib_installer perl script and run it.
  • Notes:
    • The installer script will
      • copy files into the right places in your local install (even if you have renamed data directories),
      • check in new versions of any installed files that have existing RCS histories files in your existing install (such as topics).
      • If the $TWIKI_PACKAGES environment variable is set to point to a directory, the installer will try to get archives from there. Otherwise it will try to download from twiki.org or cpan.org, as appropriate.
      • (Developers only: the script will look for twikiplugins/BuildContrib/BuildContrib.tgz before downloading from TWiki.org)
    • If you don't have a permanent connection, you can still use the automatic installer, by downloading all required TWiki archives to a local directory.
      • Point the environment variable $TWIKI_PACKAGES to this directory, and the installer script will look there first for required TWiki packages. # $TWIKI_PACKAGES is actually a path; you can list several directories separated by :
      • If you are behind a firewall that blocks access to CPAN, you can pre-install the required CPAN libraries, as described at http://twiki.org/cgi-bin/view/TWiki/HowToInstallCpanModules
  • If you don't want to use the installer script, or have problems on your platform (e.g. you don't have Perl 5.8), then you can still install manually:
    1. Download and unpack one of the .zip or .tgz archives to a temporary directory.
    2. Manually copy the contents across to the relevant places in your TWiki installation.
    3. Repeat from step 1 for any missing dependencies.
  • use configure to configure the advanced features
    • enable SearchEngineKinoSearchPlugin
    • {SearchEngineKinoSearchAddOn}{showAttachments}
    • {SearchEngineKinoSearchPlugin}{EnableOnSaveUpdates}
    • {SearchEngineKinoSearchAddOn}{WordIndexer}
    • $TWiki::cfg{RCS}{SearchAlgorithm} = 'TWiki::Store::SearchAlgorithms::Kino';
    • Set SKIN=kino,tagme, topmenu, pattern

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server where TWiki is running.

Like many other TWiki extensions, this module is shipped with a fully automatic installer script written using the BuildContrib.

  • If you have TWiki 4.2 or later, you can install from the configure interface (Go to Plugins->Find More Extensions)
  • If you have any problems, then you can still install manually from the command-line:
    1. Download one of the .zip or .tgz archives
    2. Unpack the archive in the root directory of your TWiki installation.
    3. Run the installer script ( perl <module>_installer )
    4. Run configure and enable the module, if it is a plugin.
    5. Repeat for any missing dependencies.
  • If you are still having problems, then instead of running the installer script:
    1. Make sure that the file permissions allow the webserver user to access all files.
    2. Check in any installed files that have existing ,v files in your existing install (take care not to lock the files when you check in)
    3. Manually edit LocalSite.cfg to set any configuration variables.

Configuration

This add-on uses several preferences which should be set at Main.TWikiPreferences. All these preferences are optional. If you are fine with the default values given below, you need not change anything.

(Note, these are not where the defaults are set)
   *  KinoSearch settings
      * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .doc, .xml, .html, .txt, .xls, .ppt, .pptx, .docx, .xlsx
      * Set KINOSEARCHSEARCHATTACHMENTSONLY = 0
      * Set KINOSEARCHSEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set KINOSEARCHINDEXSKIPWEBS = Trash, Sandbox
      * Set KINOSEARCHINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
      * Set KINOSEARCHANALYSERLANGUAGE = en
      * Set KINOSEARCHSUMMARYLENGTH = 300
      * Set KINOSEARCHDEBUG = 0
      * Set KINOSEARCHMAXLIMIT = 2000
      * Set KINOSEARCH_ATTACHMENT_INDEX_SIZELIMIT = 2000

You can also configure (The Extensions:SearchEngineKinoSearchAddOn section) where the index and log files are created. Note: The directories must exist.

$TWiki::cfg{KinoSearchLogDir} = '/home/httpd/twiki/kinosearch/logs';
$TWiki::cfg{KinoSearchIndexDir} = '/home/httpd/twiki/kinosearch/index';

Remember to edit the file kinosearch/bin/LocalLib.cfg and modify twikiLibPath accordingly to your configuration

Skipping of Attachments with problems in stringifying

A few times its impossible to stringify the attachments. In this case best way would be to skip the attachments from indexing. In case our stringy libraries fail to stringy, they add details of the attachment in TWiki's work area under SearchEngineKinoSearchAddOn directory (e.g. /var/www/twiki/working/work_areas/SearchEngineKinoSearchAddOn directory if your TWIKI_ROOT is /var/www/twiki and $TWiki::cfg{WorkingDir} = /var/www/twiki/working )

These attachments are skipped during next time indexing.

Test of the installation

  • Test if the installation was successful:
    • Check that antiword, abiword or wvHtml is in place: Type antiword, abiword or wvHtml on the prompt and check that the command exists.
    • Check that pdftotext is in place: Type pdftotext on the prompt and check that the command exists.
    • Check that ppthtml is in place: Type ppthtml on the prompt and check that the command exists.
    • Change the working directory to the kinosearch/bin twiki installation directory.
    • Run ./kinoindex
    • Once finished, open a browser window and point it to the TWiki/KinoSearch topic.
    • Just type a query and check the results.

Test of stringification with ks_test

Some users report problems with the stringification: The kinoindex scipts fails, takes too long on attachments or kinosearch does not yield correct results. Some times this may result from installation errors esp. of the installation of the backends for the stringification.

ks_test give you the opportunity to test the stringification in advance.

Usage: ks_test stringify file_name

(I plan to extend ks_test, but at the moment the only possible second parameter is stringify).

In the result you see, which stringifier is used and the result of the stringification.

Example:

/home/httpd/twiki/kinosearch/bin$ ./ks_test stringify /home/httpd/twiki_svn/SearchEngineKinoSearchAddOn/test/unit/SearchEngineKinoSearchAddOn/attachement_examples/Simple_example.doc

Used stringifier: TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::DOC_antiword

Stringified text:

  Simple example  Keyword: dummy  Umlaute: Größer, Überschall, Änderung

You see that the stringifier DOC_antiword is used and the resulting text seems to be O.K.

Add-On Info

  • Set SHORTDESCRIPTION = Fast indexed SEARCH of topics and attachments (eg Word, Excel, PDF and PPT)

Add-on Author: TWiki:Main/MarkusHesse, TWiki:Main.SvenDowideit, TWiki:Main.SopanShewale
Add-on Version: 2012-11-13
Copyright: © 2007-2009 TWiki:Main.DavidGuest
© 2009 Twiki, Inc
© 2009-2012 TWiki:TWiki.TWikiContributor
License: GPL (GNU General Public License)
Change History:  
2012-11-13: TWikibug:Item7020: Categorize TWiki Variable -- TWiki:Main.PeterThoeny
9 Oct 2009: version 1.19, added support to index documents of type .docx, .pptx, .xlsx
Bug:Item6177:Attachments with issues to stringify are added into work area, they are skipped from indexing next time
20 Aug 2008: v 1.18, added Integrated SEARCH, SearchEngineKinoSearchPlugin, restHandlers, updated code and tests -- TWiki:Main.SvenDowideit
6 Aug 2008: v 1.17, Bugs:Item5717: persist use form choices, Bugs:Item5647: cope better with attachment problems -- TWiki:Main.SvenDowideit
4 Jun 2008: v 1.16, Bugs:Item5646: Problem with attachments with capital letter suffix
12 May 2008: v 1.15, Bugs:Item5579, Bugs:Item5580, Bugs:Item5619: Problem with ALLOWWEBVIEW and Forms fixed
23 Apr 2008: v 1.14, Bugs:Item5273, Bugs:Item5546, Bugs:Item5550, Bugs:Item5552: Use current user in search script
27 Jan 2008: v 1.13, Bugs:Item5271: Option "show locked topics" now works
19 Jan 2008: v 1.12, Bugs:Item5270: Enhancement of stringifiers
19 Dec 2007: v 1.11, Additions on stringifiers, modification of output format
17 Nov 2007: v 1.10, PPT stringifier added
11 Nov 2007: v 1.09, Some bugfixing
3 Nov 2007: v 1.08, Some bugfixing
7 Oct 2007: v 1.07, Some bugfixing
6 Oct 2007: v 1.06, Upgrade for 4.1, Release with BuildContrib
29 Sep 2007: v 1.05, Indexing of form fields
16 Sep 2007: v 1.04, Stringifier plugins for doc, xls and html
13 Sep 2007: v 1.03, Indexing of PDF and TXT attachments
08 Sep 2007: v 1.02, Index and update script enhanced
24 Aug 2007: v 1.01, Update script included, Result uses highlighter
14 Aug 2007: Initial version (v1.000)
CPAN Dependencies: CPAN:KinoSearch
  CPAN:File::MMagic
  CPAN:Module::Pluggable
  CPAN:HTML::TreeBuilder
  CPAN:Spreadsheet::ParseExcel
  CPAN:CharsetDetector
  CPAN:Encode
Other Dependencies: pdftotext (part of xpdf-utils)
  antiword, abiword or wvWare
  ppthtml
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEngineKinoSearchAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEngineKinoSearchAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEngineKinoSearchAddOnAppraisal

Related Topics: VarKINOSEARCH, SearchEngineKinoSearchPlugin, VarSEARCH

Topic revision: r0 - 2012-11-14 - TWikiContributor
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Note: Please contribute updates to this topic on TWiki.org at TWiki:TWiki.SearchEngineKinoSearchAddOn.