Difference: SearchEnginePluceneAddOn (1 vs. 13)

Revision 132006-06-27 - JoanMVigo

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.

I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. The DOT before the extension type is required. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc
or whatever extensions you want. By default, Plucene comes with PDF, HTML and TXT file support. However, PDF needs additional software to be installed (see intall instructions).

You may need additional CPAN:Plucene::SearchEngine::Index libraries and install additional third party tools such as antiword or xlhtml which provide required text extracting capabilities. You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
    • all other webs are indexed, however if a web has Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO = JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Search form

The following form submits text to the plucsearch script. The installation instructions are detailed below.
| Help

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • You can install Plucene and its dependencies running:
    • perl -MCPAN -e "install Plucene"
    • perl -MCPAN -e "install Plucene::SearchEngine"
  • Install third party text extracting tools, like xpdf which provides pdftotext. OPTIONAL You may wish to install additional CPAN:Plucene::SearchEngine::Index libraries so that this add on can index such file types. More information at TWiki:Plugins/SearchEnginePluceneAddOnDev#ExtraBackendParsers
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
    plucene/logs/ the index and update logs will be written here - admin should monitor this folder

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc
      * Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index
      * Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub
      * Set PLUCENESEARCHATTACHMENTSONLY = 1
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
      * Set PLUCENEINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
      * Set PLUCENEDEBUG = 1
  • ATTENTION! Remember to edit the file plucene/bin/LocalLib.cfg and modify twikiLibPath accordingly to your configuration
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Changed:
<
<
Add-on Version: 27 Jun 2006 (v2.200 for Dakar, v1.400 for Cairo)
>
>
Add-on Version: 27 Jun 2006 (v2.200 for Dakar, v1.500 for Cairo)
 
Change History:
<-- versions below in reverse order -->
 
Changed:
<
<
27 Jun 2006: TWikiDakar (v2.200) - Searching issue solved when using template authentication
>
>
27 Jun 2006: TWikiDakar (v2.200) - Searching issue solved when using template authentication, update index bug solved
Added:
>
>
27 Jun 2006: TWikiCairo (v1.500) - Update index bug solved
 
21 Mar 2006: TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved
03 Mar 2006: TWikiDakar (v2.000) & TWikiCairo (v1.300)
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: CPAN:Bit::Vector::Minimal, CPAN:IO::Scalar, CPAN:Lingua::GL::Stemmer, CPAN:Lingua::PT::Stemmer, CPAN:Lingua::Stem::Fr, CPAN:Lingua::Stem::It, CPAN:Lingua::Stem::Ru, CPAN:Lingua::Stem::Snowball::Da, CPAN:Lingua::Stem::Snowball::No, CPAN:Lingua::Stem::Snowball::Se, CPAN:Text::German, CPAN:Lingua::Stem::En, CPAN:Tie::Array::Sorted, CPAN:Time::Piece, CPAN:Plucene, CPAN:Plucene::SearchEngine
Other Dependencies: xpdf (pdftotext) and additional 3rd party tools for text extracting
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

-- TWiki:Main/JoanMVigo - 27 Jun 2006

Revision 122006-06-27 - JoanMVigo

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.

I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. The DOT before the extension type is required. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc
or whatever extensions you want. By default, Plucene comes with PDF, HTML and TXT file support. However, PDF needs additional software to be installed (see intall instructions).

You may need additional CPAN:Plucene::SearchEngine::Index libraries and install additional third party tools such as antiword or xlhtml which provide required text extracting capabilities. You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
    • all other webs are indexed, however if a web has Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO = JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Search form

The following form submits text to the plucsearch script. The installation instructions are detailed below.
| Help

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • You can install Plucene and its dependencies running:
    • perl -MCPAN -e "install Plucene"
    • perl -MCPAN -e "install Plucene::SearchEngine"
  • Install third party text extracting tools, like xpdf which provides pdftotext. OPTIONAL You may wish to install additional CPAN:Plucene::SearchEngine::Index libraries so that this add on can index such file types. More information at TWiki:Plugins/SearchEnginePluceneAddOnDev#ExtraBackendParsers
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
    plucene/logs/ the index and update logs will be written here - admin should monitor this folder

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc
      * Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index
      * Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub
      * Set PLUCENESEARCHATTACHMENTSONLY = 1
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
      * Set PLUCENEINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
      * Set PLUCENEDEBUG = 1
  • ATTENTION! Remember to edit the file plucene/bin/LocalLib.cfg and modify twikiLibPath accordingly to your configuration
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Add-on Version: 27 Jun 2006 (v2.200 for Dakar, v1.400 for Cairo)
Change History:
<-- versions below in reverse order -->
 
27 Jun 2006: TWikiDakar (v2.200) - Searching issue solved when using template authentication
21 Mar 2006: TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved
03 Mar 2006: TWikiDakar (v2.000) & TWikiCairo (v1.300)
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: CPAN:Bit::Vector::Minimal, CPAN:IO::Scalar, CPAN:Lingua::GL::Stemmer, CPAN:Lingua::PT::Stemmer, CPAN:Lingua::Stem::Fr, CPAN:Lingua::Stem::It, CPAN:Lingua::Stem::Ru, CPAN:Lingua::Stem::Snowball::Da, CPAN:Lingua::Stem::Snowball::No, CPAN:Lingua::Stem::Snowball::Se, CPAN:Text::German, CPAN:Lingua::Stem::En, CPAN:Tie::Array::Sorted, CPAN:Time::Piece, CPAN:Plucene, CPAN:Plucene::SearchEngine
Other Dependencies: xpdf (pdftotext) and additional 3rd party tools for text extracting
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

-- TWiki:Main/JoanMVigo - 27 Jun 2006

Revision 112006-06-27 - JoanMVigo

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.

I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. The DOT before the extension type is required. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc
or whatever extensions you want. By default, Plucene comes with PDF, HTML and TXT file support. However, PDF needs additional software to be installed (see intall instructions).

You may need additional CPAN:Plucene::SearchEngine::Index libraries and install additional third party tools such as antiword or xlhtml which provide required text extracting capabilities. You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
    • all other webs are indexed, however if a web has Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO = JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Search form

The following form submits text to the plucsearch script. The installation instructions are detailed below.
| Help

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • You can install Plucene and its dependencies running:
    • perl -MCPAN -e "install Plucene"
    • perl -MCPAN -e "install Plucene::SearchEngine"
  • Install third party text extracting tools, like xpdf which provides pdftotext. OPTIONAL You may wish to install additional CPAN:Plucene::SearchEngine::Index libraries so that this add on can index such file types. More information at TWiki:Plugins/SearchEnginePluceneAddOnDev#ExtraBackendParsers
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
    plucene/logs/ the index and update logs will be written here - admin should monitor this folder

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc
      * Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index
      * Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub
      * Set PLUCENESEARCHATTACHMENTSONLY = 1
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
      * Set PLUCENEINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
      * Set PLUCENEDEBUG = 1
  • ATTENTION! Remember to edit the file plucene/bin/LocalLib.cfg and modify twikiLibPath accordingly to your configuration
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Add-on Version: 27 Jun 2006 (v2.200 for Dakar, v1.400 for Cairo)
Change History:
<-- versions below in reverse order -->
 
27 Jun 2006: TWikiDakar (v2.200) - Searching issue solved when using template authentication
21 Mar 2006: TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved
03 Mar 2006: TWikiDakar (v2.000) & TWikiCairo (v1.300)
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: CPAN:Bit::Vector::Minimal, CPAN:IO::Scalar, CPAN:Lingua::GL::Stemmer, CPAN:Lingua::PT::Stemmer, CPAN:Lingua::Stem::Fr, CPAN:Lingua::Stem::It, CPAN:Lingua::Stem::Ru, CPAN:Lingua::Stem::Snowball::Da, CPAN:Lingua::Stem::Snowball::No, CPAN:Lingua::Stem::Snowball::Se, CPAN:Text::German, CPAN:Lingua::Stem::En, CPAN:Tie::Array::Sorted, CPAN:Time::Piece, CPAN:Plucene, CPAN:Plucene::SearchEngine
Other Dependencies: xpdf (pdftotext) and additional 3rd party tools for text extracting
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

-- TWiki:Main/JoanMVigo - 27 Jun 2006

Revision 102006-06-27 - JoanMVigo

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.

I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. The DOT before the extension type is required. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc
or whatever extensions you want. By default, Plucene comes with PDF, HTML and TXT file support. However, PDF needs additional software to be installed (see intall instructions).

You may need additional CPAN:Plucene::SearchEngine::Index libraries and install additional third party tools such as antiword or xlhtml which provide required text extracting capabilities. You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
    • all other webs are indexed, however if a web has Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO = JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Search form

The following form submits text to the plucsearch script. The installation instructions are detailed below.
| Help

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • You can install Plucene and its dependencies running:
    • perl -MCPAN -e "install Plucene"
    • perl -MCPAN -e "install Plucene::SearchEngine"
  • Install third party text extracting tools, like xpdf which provides pdftotext. OPTIONAL You may wish to install additional CPAN:Plucene::SearchEngine::Index libraries so that this add on can index such file types. More information at TWiki:Plugins/SearchEnginePluceneAddOnDev#ExtraBackendParsers
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
    plucene/logs/ the index and update logs will be written here - admin should monitor this folder

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc
      * Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index
      * Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub
      * Set PLUCENESEARCHATTACHMENTSONLY = 1
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
      * Set PLUCENEINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
      * Set PLUCENEDEBUG = 1
  • ATTENTION! Remember to edit the file plucene/bin/LocalLib.cfg and modify twikiLibPath accordingly to your configuration
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Changed:
<
<
Add-on Version: 21 Mar 2006 (v2.100 for Dakar, v1.400 for Cairo)
>
>
Add-on Version: 27 Jun 2006 (v2.200 for Dakar, v1.400 for Cairo)
 
Change History:
<-- versions below in reverse order -->
 
Added:
>
>
27 Jun 2006: TWikiDakar (v2.200) - Searching issue solved when using template authentication
 
21 Mar 2006: TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved
03 Mar 2006: TWikiDakar (v2.000) & TWikiCairo (v1.300)
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: CPAN:Bit::Vector::Minimal, CPAN:IO::Scalar, CPAN:Lingua::GL::Stemmer, CPAN:Lingua::PT::Stemmer, CPAN:Lingua::Stem::Fr, CPAN:Lingua::Stem::It, CPAN:Lingua::Stem::Ru, CPAN:Lingua::Stem::Snowball::Da, CPAN:Lingua::Stem::Snowball::No, CPAN:Lingua::Stem::Snowball::Se, CPAN:Text::German, CPAN:Lingua::Stem::En, CPAN:Tie::Array::Sorted, CPAN:Time::Piece, CPAN:Plucene, CPAN:Plucene::SearchEngine
Other Dependencies: xpdf (pdftotext) and additional 3rd party tools for text extracting
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal
Changed:
<
<
-- TWiki:Main/JoanMVigo - 21 Mar 2006
>
>
-- TWiki:Main/JoanMVigo - 27 Jun 2006
 

Revision 92006-03-21 - TWikiGuest

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.

I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. The DOT before the extension type is required. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc
or whatever extensions you want. By default, Plucene comes with PDF, HTML and TXT file support. However, PDF needs additional software to be installed (see intall instructions).

You may need additional CPAN:Plucene::SearchEngine::Index libraries and install additional third party tools such as antiword or xlhtml which provide required text extracting capabilities. You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
    • all other webs are indexed, however if a web has Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO = JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Search form

The following form submits text to the plucsearch script. The installation instructions are detailed below.
| Help

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • You can install Plucene and its dependencies running:
    • perl -MCPAN -e "install Plucene"
    • perl -MCPAN -e "install Plucene::SearchEngine"
Changed:
<
<
  • Install third party tools, like xpdf which provides pdftotext
>
>
 
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
    plucene/logs/ the index and update logs will be written here - admin should monitor this folder

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc
      * Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index
      * Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub
      * Set PLUCENESEARCHATTACHMENTSONLY = 1
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
      * Set PLUCENEINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
      * Set PLUCENEDEBUG = 1
  • ATTENTION! Remember to edit the file plucene/bin/LocalLib.cfg and modify twikiLibPath accordingly to your configuration
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Changed:
<
<
Add-on Version: 20 Mar 2006 (v2.100 for Dakar, v1.400 for Cairo)
>
>
Add-on Version: 21 Mar 2006 (v2.100 for Dakar, v1.400 for Cairo)
 
Change History:
<-- versions below in reverse order -->
 
Changed:
<
<
20 Mar 2006: TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved
>
>
21 Mar 2006: TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved
 
03 Mar 2006: TWikiDakar (v2.000) & TWikiCairo (v1.300)
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: CPAN:Bit::Vector::Minimal, CPAN:IO::Scalar, CPAN:Lingua::GL::Stemmer, CPAN:Lingua::PT::Stemmer, CPAN:Lingua::Stem::Fr, CPAN:Lingua::Stem::It, CPAN:Lingua::Stem::Ru, CPAN:Lingua::Stem::Snowball::Da, CPAN:Lingua::Stem::Snowball::No, CPAN:Lingua::Stem::Snowball::Se, CPAN:Text::German, CPAN:Lingua::Stem::En, CPAN:Tie::Array::Sorted, CPAN:Time::Piece, CPAN:Plucene, CPAN:Plucene::SearchEngine
Other Dependencies: xpdf (pdftotext) and additional 3rd party tools for text extracting
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal
Changed:
<
<
-- TWiki:Main/JoanMVigo - 20 Mar 2006
>
>
-- TWiki:Main/JoanMVigo - 21 Mar 2006
 

Revision 82006-03-20 - TWikiGuest

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.

I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

Changed:
<
<
All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. You can copy & paste the next lines in your TWiki.TWikiPreferences topic
>
>
All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. The DOT before the extension type is required. You can copy & paste the next lines in your TWiki.TWikiPreferences topic
 
   * Plucene settings
Changed:
<
<
    • Set PLUCENEINDEXEXTENSIONS = pdf, html, txt, doc
>
>
    • Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc
 
Changed:
<
<
or whatever extensions you want. Remember that you may need additional CPAN:Plucene::SearchEngine::Index libraries and install required third party tools such as antiword or xlhtml.
>
>
or whatever extensions you want. By default, Plucene comes with PDF, HTML and TXT file support. However, PDF needs additional software to be installed (see intall instructions).
 
Added:
>
>
You may need additional CPAN:Plucene::SearchEngine::Index libraries and install additional third party tools such as antiword or xlhtml which provide required text extracting capabilities.
 You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
    • all other webs are indexed, however if a web has Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
Changed:
<
<
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO=JohnSmith in its WebPreferences.
>
>
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO = JohnSmith in its WebPreferences.
 
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Search form

The following form submits text to the plucsearch script. The installation instructions are detailed below.
| Help

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

Changed:
<
<
  • Once you have compiled and installed all the requirements
>
>
  • You can install Plucene and its dependencies running:
Added:
>
>
    • perl -MCPAN -e "install Plucene"
    • perl -MCPAN -e "install Plucene::SearchEngine"
  • Install third party tools, like xpdf which provides pdftotext
 
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
    plucene/logs/ the index and update logs will be written here - admin should monitor this folder
Changed:
<
<
>
>
 
   * Plucene settings
Changed:
<
<
    • Set PLUCENEINDEXEXTENSIONS = pdf, htm, html, txt, doc
    • Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index or whatever path your index folder is located
    • Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub or whatever path your pub folder is located
>
>
    • Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc
    • Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index
    • Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub
 
    • Set PLUCENESEARCHATTACHMENTSONLY = 1
    • Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
    • Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
    • Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
    • Set PLUCENEINDEXSKIPATTACHMENTS = AnAttachment.txt, OtherAttachment.pdf
    • Set PLUCENEDEBUG = 1
Changed:
<
<
  • Remember to edit the file LocalLib.cfg and modify twikiLibPath accordingly to your configuration
>
>
  • ATTENTION! Remember to edit the file plucene/bin/LocalLib.cfg and modify twikiLibPath accordingly to your configuration
 
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Changed:
<
<
Add-on Version: 03 Mar 2006 (v2.000 for Dakar, v1.300 for Cairo)
>
>
Add-on Version: 20 Mar 2006 (v2.100 for Dakar, v1.400 for Cairo)
 
Change History:
<-- versions below in reverse order -->
 
Changed:
<
<
03 Mar 2006: TWikiDakar release compatible version (v2.000)
03 Mar 2006: TWikiCairo release compatible version (v1.300)
>
>
20 Mar 2006: TWikiDakar (v2.100) & TWikiCairo (v1.400) - Update index issue solved
03 Mar 2006: TWikiDakar (v2.000) & TWikiCairo (v1.300)
 
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
Changed:
<
<
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
>
>
CPAN Dependencies: CPAN:Bit::Vector::Minimal, CPAN:IO::Scalar, CPAN:Lingua::GL::Stemmer, CPAN:Lingua::PT::Stemmer, CPAN:Lingua::Stem::Fr, CPAN:Lingua::Stem::It, CPAN:Lingua::Stem::Ru, CPAN:Lingua::Stem::Snowball::Da, CPAN:Lingua::Stem::Snowball::No, CPAN:Lingua::Stem::Snowball::Se, CPAN:Text::German, CPAN:Lingua::Stem::En, CPAN:Tie::Array::Sorted, CPAN:Time::Piece, CPAN:Plucene, CPAN:Plucene::SearchEngine
Other Dependencies: xpdf (pdftotext) and additional 3rd party tools for text extracting
 
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal
Changed:
<
<
-- TWiki:Main/JoanMVigo - 02 Mar 2006
>
>
-- TWiki:Main/JoanMVigo - 20 Mar 2006
Added:
>
>
 

Revision 72006-03-02 - TWikiGuest

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.

I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = pdf, html, txt, doc
or whatever extensions you want. Remember that you may need additional CPAN:Plucene::SearchEngine::Index libraries and install required third party tools such as antiword or xlhtml.

You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
Added:
>
>
    • all other webs are indexed, however if a web has Set NOSEARCHALL = on in its WebPreferences, then no topic from that web is shown when displaying results
 
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO=JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Added:
>
>

Search form

The following form submits text to the plucsearch script. The installation instructions are detailed below.
| Help
 

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
    plucene/logs/ the index and update logs will be written here - admin should monitor this folder

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = pdf, htm, html, txt, doc
      * Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index _or whatever path your index folder is located_
      * Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub _or whatever path your pub folder is located_
      * Set PLUCENESEARCHATTACHMENTSONLY = 1
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
      * Set PLUCENEINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
      * Set PLUCENEDEBUG = 1
  • Remember to edit the file LocalLib.cfg and modify twikiLibPath accordingly to your configuration
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Add-on Version: 03 Mar 2006 (v2.000 for Dakar, v1.300 for Cairo)
Change History:
<-- versions below in reverse order -->
 
03 Mar 2006: TWikiDakar release compatible version (v2.000)
03 Mar 2006: TWikiCairo release compatible version (v1.300)
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

-- TWiki:Main/JoanMVigo - 02 Mar 2006

Deleted:
<
<
 

Revision 62006-03-02 - TWikiGuest

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Changed:
<
<
Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a new search engine, with Plucene as its backend.
>
>
Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a topic/attachment search engine, with Plucene as its backend.
  I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.
Changed:
<
<
Help Note that this plugin have a release for each TWiki major version, Cairo and Dakar.
>
>
Note that this plugin have a release for each TWiki major version, namely Cairo and Dakar.
 

Usage

Indexing with plucindex

The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = pdf, html, txt, doc
or whatever extensions you want. Remember that you may need additional CPAN:Plucene::SearchEngine::Index libraries and install required third party tools such as antiword or xlhtml.

You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO=JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
Deleted:
<
<
templates/plucsearch.tmpl template used by new search script for the standard skin
 
plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
plucene/bin/plucupdate script that uses web's .changes files to update the index
plucene/index/ directory for index files to be stored
plucene/logs/ the index and update logs will be written here - admin should monitor this folder

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = pdf, htm, html, txt, doc
      * Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index _or whatever path your index folder is located_
      * Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub _or whatever path your pub folder is located_
      * Set PLUCENESEARCHATTACHMENTSONLY = 1
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
Changed:
<
<
>
>
 
    • Set PLUCENEDEBUG = 1
Added:
>
>
  • Remember to edit the file LocalLib.cfg and modify twikiLibPath accordingly to your configuration
 
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Add-on Version: 03 Mar 2006 (v2.000 for Dakar, v1.300 for Cairo)
Change History:
<-- versions below in reverse order -->
 
03 Mar 2006: TWikiDakar release compatible version (v2.000)
03 Mar 2006: TWikiCairo release compatible version (v1.300)
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal
Changed:
<
<
Related Topic: TWikiAddOns
>
>
-- TWiki:Main/JoanMVigo - 02 Mar 2006
Deleted:
<
<
-- TWiki:Main/JoanMVigo - 03 Mar 2006
 

Revision 52006-02-28 - TWikiGuest

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Changed:
<
<
>
>
Deleted:
<
<
 
Changed:
<
<
I'm not a Perl guru, however I found Plucene, which is a Perl port of the java library Lucene, so I tried to implement a new search engine, using Plucene as its backend.
>
>
Time ago I found Plucene, which is a Perl port of the java library Lucene. So this plugin/addon intends to be a new search engine, with Plucene as its backend.
 
Added:
>
>
I would like to thank TWiki:Main.SopanShewale for his many suggestions and contributions.

Help Note that this plugin have a release for each TWiki major version, Cairo and Dakar.

 

Usage

Indexing with plucindex

Changed:
<
<
The plucindex script indexes all the content of your data folder, and it uses some TWiki code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).
>
>
The plucindex script indexes all the public webs, and it uses some TWiki::Func code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).
 
Changed:
<
<
By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. It should not be invoked by browser.
>
>
By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.
  Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev
Deleted:
<
<

Searching with plucsearch

The plucsearch script uses one of the templates plucsearh.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with plucsearch script.

However, the query syntax is quite different:

  • you can use and, or
  • if you want to search inside the topic body, you should use the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene
  • plucene
  • author:JoanMVigo
  • TopicClassification:ItemToDo
  • type:pdf and learning

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

 

Updating with plucupdate

Changed:
<
<
The plucupdate script uses the web's .changes files to know about topic modifications, in a way such mailnotify works. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).
>
>
The plucupdate script uses the web's .changes files to know about topic modifications, in a way such old mailnotify worked. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).
 
Changed:
<
<
This script should be executed by an hourly crontab. It should not be invoked by browser.
>
>
This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.
  Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

Changed:
<
<
  • Plucene settings
    • Set PLUCENEINDEXEXTENSIONS = .pdf,.html,.txt,.doc
>
>
  • Plucene settings
    • Set PLUCENEINDEXEXTENSIONS = pdf, html, txt, doc
  or whatever extensions you want. Remember that you may need additional CPAN:Plucene::SearchEngine::Index libraries and install required third party tools such as antiword or xlhtml.
Changed:
<
<
You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks to TWiki:Main/SopanShewale for his contributions.
>
>
You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks again to TWiki:Main/SopanShewale for his contributions.
 
Added:
>
>

Searching with plucsearch

The plucsearch script uses a template plucsearch.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with the plucsearch script.

The query syntax has been improved

  • you can use + for and and - for and not
  • you can limit to the topic body or attachment body, using the prefix text: or just type the search string
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name (like author)
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results (like type:pdf)
  • attachments also have a special field, attachment:yes, which is used in the PluceneSearch topic to search again only displaying attachments

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene searches for plucene in topic/attachment text
  • plucene as above
  • author:JoanMVigo searches for topics/attachments authored by this author
  • TopicClassification:ItemToDo searches for topics with a form field named TopicClassification with value ItemToDo
  • +perl -type:pdf +attachment:yes searches for attachments only with perl as text, excluding PDF files

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Other features

This new version provides some extra functionality:

  • skip unuseful webs from the index (with a new preference PLUCENEINDEXSKIPWEBS)
  • skip annoying or unindexable attachments from the index (with a new preference PLUCENEINDEXSKIPATTACHMENTS)
  • index variables for web (with a new preference PLUCENEINDEXVARIABLES). For example, if set to CONTACTINFO, a search for CONTACTINFO:JohnSmith will provide the WebHome topics of the webs which have Set CONTACTINFO=JohnSmith in its WebPreferences.
  • displaying the search results, show an option for diaplaying only attachments if PLUCENESEARCHATTACHMENTSONLY enabled. You can set PLUCENESEARCHATTACHMENTSONLYLABEL to a text or an image.

Please, to request further features read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

 

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

Changed:
<
<
  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    bin/plucupdate script that uses web's .changes files to update the index
    bin/plucsearch script that searches the index files
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    templates/plucsearch.tmpl template used by new search script for the standard skin
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/Plugins/SearchEnginePluceneAddOn.txt Add-on topic
    data/Plugins/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    index/ directory for index files to be stored
  • ATTENTION! Now the $idxpath variable is loaded with the new TWiki preference PLUCENEINDEXPATH value, so you should add to your TWiki.TWikiPreferences topic the next text
>
>
  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucsearch script that searches the index files
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/TWiki/SearchEnginePluceneAddOn.txt Add-on topic
    data/TWiki/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
    templates/plucsearch.tmpl template used by new search script for the standard skin
    plucene/bin/LocalLib.cfg this file should is required and should be modified according to the twiki/lib absolute path of your installation
    plucene/bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    plucene/bin/plucupdate script that uses web's .changes files to update the index
    plucene/index/ directory for index files to be stored
Added:
>
>
plucene/logs/ the index and update logs will be written here - admin should monitor this folder

 
Changed:
<
<
  • Plucene settings
    • Set PLUCENEINDEXPATH = /srv/www/personal/index or whatever path your index folder is located
>
>
  • Plucene settings
    • Set PLUCENEINDEXEXTENSIONS = pdf, htm, html, txt, doc
Added:
>
>
    • Set PLUCENEINDEXPATH = /srv/www/twiki/plucene/index or whatever path your index folder is located
    • Set PLUCENEATTACHMENTSPATH = /srv/www/twiki/pub or whatever path your pub folder is located
    • Set PLUCENESEARCHATTACHMENTSONLY = 1
    • Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
    • Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
    • Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
    • Set PLUCENEINDEXSKIPATTACHMENTS = JSCalendarContrib.simple-1.html, JSCalendarContrib.simple-3.html
    • Set PLUCENEDEBUG = 1
 
Changed:
<
<
  • Test if the installation was successful:
    • change the working directory to your bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the bin/plucupdate script.
>
>
  • Test if the installation was successful:
    • change the working directory to the plucene/bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the plucene/bin/plucupdate script.
 

Add-On Info

Changed:
<
<
Add-on Author: TWiki:Main/JoanMVigo
Add-on Version: 26 Nov 2004 (v1.200)
>
>
Add-on Author: TWiki:Main/SopanShewale, TWiki:Main/JoanMVigo
Add-on Version: 03 Mar 2006 (v2.000 for Dakar, v1.300 for Cairo)
 
Change History:
<-- versions below in reverse order -->
 
Added:
>
>
03 Mar 2006: TWikiDakar release compatible version (v2.000)
03 Mar 2006: TWikiCairo release compatible version (v1.300)
 
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

Related Topic: TWikiAddOns

Changed:
<
<
-- TWiki:Main/JoanMVigo - 15 Dec 2004
>
>
-- TWiki:Main/JoanMVigo - 03 Mar 2006
 

Revision 42004-12-15 - jmv

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

I'm not a Perl guru, however I found Plucene, which is a Perl port of the java library Lucene, so I tried to implement a new search engine, using Plucene as its backend.

Usage

Indexing with plucindex

Changed:
<
<
The plucindex script indexes all the content of your data folder, and it uses some TWiki code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (only PDF/HTML/TXT).
>
>
The plucindex script indexes all the content of your data folder, and it uses some TWiki code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (see below for more details).
  By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Searching with plucsearch

Changed:
<
<
The plucsearch script uses the plucsearh.tmpl template that can be adapted to your site skin easily. I've also attached a PluceneSearch topic with a form ready to use with plucsearch script.
>
>
The plucsearch script uses one of the templates plucsearh.tmpl (that can be adapted to your site skin easily) or the plucsearch.pattern.tmpl (if you use the pattern skin). There is also a PluceneSearch topic with a form ready to use with plucsearch script.
  However, the query syntax is quite different:
  • you can use and, or
Changed:
<
<
  • if you want to search inside the topic body, you should use the prefix text:
>
>
  • if you want to search inside the topic body, you should use the prefix text: or just type the search string
 
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name
Added:
>
>
  • plucene adds the type field for the indexed attachments, so you can use it to filter your results
  Query examples (just type it in your PluceneSearch site topic)
  • text:plucene
Added:
>
>
  • plucene
 
  • author:JoanMVigo
Added:
>
>
  • TopicClassification:ItemToDo
  • type:pdf and learning
  Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such mailnotify works. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Added:
>
>

Attachment file types to be indexed

All the PDF, HTML and text attachments are also indexed by default. If you want to override this setting you can use a TWiki preference PLUCENEINDEXEXTENSIONS. You can copy & paste the next lines in your TWiki.TWikiPreferences topic

   * Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf,.html,.txt,.doc
or whatever extensions you want. Remember that you may need additional CPAN:Plucene::SearchEngine::Index libraries and install required third party tools such as antiword or xlhtml.

You can find/post additional CPAN:Plucene::SearchEngine::Index libraries for many file types at TWiki:Plugins/SearchEnginePluceneAddOnDev. Thanks to TWiki:Main/SopanShewale for his contributions.

 

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    bin/plucupdate script that uses web's .changes files to update the index
    bin/plucsearch script that searches the index files
Changed:
<
<
templates/plucsearch.pattern.tmpl template used by new search script
>
>
templates/plucsearch.pattern.tmpl template used by new search script for the pattern skin
Added:
>
>
templates/plucsearch.tmpl template used by new search script for the standard skin
 
data/TWiki/PluceneSearch.txt Plucene search topic
data/TWiki/PluceneSearch.txt,v Plucene search topic repository
data/Plugins/SearchEnginePluceneAddOn.txt Add-on topic
data/Plugins/SearchEnginePluceneAddOn.txt,v Add-on topic repository
index/ directory for index files to be stored
Changed:
<
<
  • All the three scripts must be edited: change the $idxpath variable to point to the newly index folder in your twiki installation directory.
>
>
  • ATTENTION! Now the $idxpath variable is loaded with the new TWiki preference PLUCENEINDEXPATH value, so you should add to your TWiki.TWikiPreferences topic the next text
Added:
>
>
   * Plucene settings
      * Set PLUCENEINDEXPATH = /srv/www/personal/index or whatever path your index folder is located
 
  • Test if the installation was successful:
    • change the working directory to your bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/JoanMVigo
Add-on Version: 26 Nov 2004 (v1.200)
Change History:
<-- versions below in reverse order -->
 
Added:
>
>
15 Dec 2004: Use of TWiki preferences for indexing path & attachment extensions (v1.210)
 
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

Related Topic: TWikiAddOns

Changed:
<
<
-- TWiki:Main/JoanMVigo - 26 Nov 2004
>
>
-- TWiki:Main/JoanMVigo - 15 Dec 2004
 

Revision 32004-11-26 - jmv

Changed:
<
<

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

I'm not a Perl guru, however I found Plucene, which is a Perl port of the java library Lucene, so I tried to implement a new search engine, using Plucene as its backend.

Usage

Indexing with plucindex

The plucindex script indexes all the content of your data folder, and it uses some TWiki code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (only PDF/HTML/TXT).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Searching with plucsearch

The plucsearch script uses the plucsearh.tmpl template that can be adapted to your site skin easily. I've also attached a PluceneSearch topic with a form ready to use with plucsearch script.

However, the query syntax is quite different:

  • you can use and, or
  • if you want to search inside the topic body, you should use the prefix text:
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene
  • author:JoanMVigo

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such mailnotify works. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    bin/plucupdate script that uses web's .changes files to update the index
    bin/plucsearch script that searches the index files
    templates/plucsearch.tmpl template used by new search script
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/Plugins/SearchEnginePluceneAddOn.txt Add-on topic
    data/Plugins/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    index/ directory for index files to be stored
  • All the three scripts must be edited: change the $idxpath variable to point to the newly index folder in your twiki installation directory.
  • Test if the installation was successful:
    • change the working directory to your bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/JoanMVigo
Add-on Version: 23 Nov 2004 (v1.100)
Change History:
<-- versions below in reverse order -->
 
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

Related Topic: TWikiAddOns

-- TWiki:Main/JoanMVigo - 23 Nov 2004

>
>

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

I'm not a Perl guru, however I found Plucene, which is a Perl port of the java library Lucene, so I tried to implement a new search engine, using Plucene as its backend.

Usage

Indexing with plucindex

The plucindex script indexes all the content of your data folder, and it uses some TWiki code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (only PDF/HTML/TXT).

By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Searching with plucsearch

The plucsearch script uses the plucsearh.tmpl template that can be adapted to your site skin easily. I've also attached a PluceneSearch topic with a form ready to use with plucsearch script.

However, the query syntax is quite different:

  • you can use and, or
  • if you want to search inside the topic body, you should use the prefix text:
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene
  • author:JoanMVigo

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such mailnotify works. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    bin/plucupdate script that uses web's .changes files to update the index
    bin/plucsearch script that searches the index files
    templates/plucsearch.pattern.tmpl template used by new search script
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/Plugins/SearchEnginePluceneAddOn.txt Add-on topic
    data/Plugins/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    index/ directory for index files to be stored
  • All the three scripts must be edited: change the $idxpath variable to point to the newly index folder in your twiki installation directory.
  • Test if the installation was successful:
    • change the working directory to your bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
  • Just create a new hourly crontab entry for the bin/plucupdate script.

Add-On Info

Add-on Author: TWiki:Main/JoanMVigo
Add-on Version: 26 Nov 2004 (v1.200)
Change History:
<-- versions below in reverse order -->
 
26 Nov 2004: TWikiCairo release compatible version (v1.200)
23 Nov 2004: Incremental version (v1.100)
18 Nov 2004: Initial version (v1.000)
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

Related Topic: TWikiAddOns

Added:
>
>
-- TWiki:Main/JoanMVigo - 26 Nov 2004
 

Revision 22004-11-18 - jmv

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

Changed:
<
<
>
>
  I'm not a Perl guru, however I found Plucene, which is a Perl port of the java library Lucene, so I tried to implement a new search engine, using Plucene as its backend.

Usage

Indexing with plucindex

The plucindex script indexes all the content of your data folder, and it uses some TWiki code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (only PDF/HTML/TXT).

Changed:
<
<
By now, you should run this script manually each time you want the index files to be updated, or just add an hourly or daily crontab to run it automatically. It should not be invoked by browser.
>
>
By now, you should run this script manually after installation to create the index files used by plucsearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. It should not be invoked by browser.
  Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Searching with plucsearch

The plucsearch script uses the plucsearh.tmpl template that can be adapted to your site skin easily. I've also attached a PluceneSearch topic with a form ready to use with plucsearch script.

However, the query syntax is quite different:

  • you can use and, or
  • if you want to search inside the topic body, you should use the prefix text:
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name
Deleted:
<
<
 
  • if you want to search using some form field, you should use the prefix field: where field is the form's field name

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene
  • author:JoanMVigo

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Added:
>
>

Updating with plucupdate

The plucupdate script uses the web's .changes files to know about topic modifications, in a way such mailnotify works. Also, a .plucupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again (the same goes for attachments).

This script should be executed by an hourly crontab. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

 

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
Added:
>
>
bin/plucupdate script that uses web's .changes files to update the index
 
bin/plucsearch script that searches the index files
templates/plucsearch.tmpl template used by new search script
data/TWiki/PluceneSearch.txt Plucene search topic
data/TWiki/PluceneSearch.txt,v Plucene search topic repository
data/Plugins/SearchEnginePluceneAddOn.txt Add-on topic
data/Plugins/SearchEnginePluceneAddOn.txt,v Add-on topic repository
index/ directory for index files to be stored
Changed:
<
<
  • Edit both scripts and change the $idxpath variable to point to the newly index folder in your twiki installation directory.
>
>
  • All the three scripts must be edited: change the $idxpath variable to point to the newly index folder in your twiki installation directory.
 
  • Test if the installation was successful:
    • change the working directory to your bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results
Added:
>
>
  • Just create a new hourly crontab entry for the bin/plucupdate script.
 

Add-On Info

Add-on Author: TWiki:Main/JoanMVigo
Changed:
<
<
Add-on Version: 18 Nov 2004 (v1.000)
>
>
Add-on Version: 23 Nov 2004 (v1.100)
 
Change History:
<-- versions below in reverse order -->
 
Changed:
<
<
18 Nov 2004: Initial version
>
>
23 Nov 2004: Incremental version (v1.100)
Added:
>
>
18 Nov 2004: Initial version (v1.000)
 
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Changed:
<
<
Other Dependencies: other CPAN packages required by above dependencies
>
>
Other Dependencies: xpdf (pdftotext) and other CPAN packages required by above dependencies
 
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

Related Topic: TWikiAddOns

Changed:
<
<
-- TWiki:Main/JoanMVigo - 18 Nov 2004
>
>
-- TWiki:Main/JoanMVigo - 23 Nov 2004
Deleted:
<
<
 

Revision 12004-11-18 - jmv

 

Plucene Search Engine Add-On

TWiki original search engine is a simple yet powerful tool. However, it can not search within attached documents. That has been discused in many topics in the Codev web:

I'm not a Perl guru, however I found Plucene, which is a Perl port of the java library Lucene, so I tried to implement a new search engine, using Plucene as its backend.

Usage

Indexing with plucindex

The plucindex script indexes all the content of your data folder, and it uses some TWiki code to retrieve the list of available webs and to retrieve their topic list. For each topic, the meta data is inspected and indexed, as the text body. Also, if the topic has attachments, those are indexed (only PDF/HTML/TXT).

By now, you should run this script manually each time you want the index files to be updated, or just add an hourly or daily crontab to run it automatically. It should not be invoked by browser.

Please, to suggest indexing improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Searching with plucsearch

The plucsearch script uses the plucsearh.tmpl template that can be adapted to your site skin easily. I've also attached a PluceneSearch topic with a form ready to use with plucsearch script.

However, the query syntax is quite different:

  • you can use and, or
  • if you want to search inside the topic body, you should use the prefix text:
  • if you want to search using some meta data, you should use the prefix field: where field is the meta data name

  • if you want to search using some form field, you should use the prefix field: where field is the form's field name

Query examples (just type it in your PluceneSearch site topic)

  • text:plucene
  • author:JoanMVigo

Please, to suggest searching improvements read/post to TWiki:Plugins/SearchEnginePluceneAddOnDev

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

  • Once you have compiled and installed all the requirements
  • Download the ZIP file from the Add-on Home (see below)
  • Unzip SearchEnginePluceneAddOn.zip in your twiki installation directory. Content:
    File: Description:
    bin/plucindex script that indexes all topics and PDF/HTML/TXT attachments
    bin/plucsearch script that searches the index files
    templates/plucsearch.tmpl template used by new search script
    data/TWiki/PluceneSearch.txt Plucene search topic
    data/TWiki/PluceneSearch.txt,v Plucene search topic repository
    data/Plugins/SearchEnginePluceneAddOn.txt Add-on topic
    data/Plugins/SearchEnginePluceneAddOn.txt,v Add-on topic repository
    index/ directory for index files to be stored
  • Edit both scripts and change the $idxpath variable to point to the newly index folder in your twiki installation directory.
  • Test if the installation was successful:
    • change the working directory to your bin twiki installation directory
    • run ./plucindex
    • once finished, open a browser window and point it to the TWiki/PluceneSearch topic
    • just type a query and check the results

Add-On Info

Add-on Author: TWiki:Main/JoanMVigo
Add-on Version: 18 Nov 2004 (v1.000)
Change History:
<-- versions below in reverse order -->
 
18 Nov 2004: Initial version
CPAN Dependencies: Plucene 1.19, Plucene-SearchEngine-1.1
Other Dependencies: other CPAN packages required by above dependencies
Perl Version: Tested with 5.8.0
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOnAppraisal

Related Topic: TWikiAddOns

-- TWiki:Main/JoanMVigo - 18 Nov 2004

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Note: Please contribute updates to this topic on TWiki.org at TWiki:TWiki.SearchEnginePluceneAddOn.