dokumentation der volltextsuche in ask23 mit nutch

version: nutch-0.8.1/
URL: http://suche.ask23.de

: 1.1. urls
: 1.2. crawl-urlfilter.txt

crawling

initialer crawl aufruf

  /usr/local/share/nutch/bin/nutch crawl urls -dir ask23 -depth 15 -topN 10000

periodischer aufruf (cronjob gesteuert)

...

urls

ausgangsseiten für den crawl prozess

/crawl/urls/urls

 http://ask23.hfbk-hamburg.de/draft/
 http://ask23.hfbk-hamburg.de/cgi-bin/wiki/wiki-ask23.pl
 http://projekte.ask23.de/
 http://www.hfbk.net/
 http://lfbmedien.hfbk.net/
 http://lern.hfbk-hamburg.de/

crawl-urlfilter.txt

/usr/local/share/nutch-0.8.1/conf/crawl-urlfilter.txt

  # The url filter file used by the crawl command.

  # Better for intranet crawling.
  # Be sure to change MY.DOMAIN.NAME to your domain name.

  # Each non-comment, non-blank line contains a regular expression
  # prefixed by '+' or '-'.  The first matching pattern in the file
  # determines whether a URL is included or ignored.  If no pattern
  # matches, the URL is ignored.

  # skip file:, ftp:, & mailto: urls
  -^(file|ftp|mailto):

  # skip image and other suffixes we can't yet parse
  -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
  # unless no use, don't parse rdf, pdf, .....
  -\.(dat|rdf|txt|pdf)$

  # skip URLs containing certain characters as probable queries, etc.
  #-[?*!@=]

  # dont skip wiki pages
  +(.*?)wiki-ask23.pl?.+
  # skip metadata searches
  -(.*?)ask23-suche1.pl?.+
  -(.*?)ask23-suche.pl?.+
  # skip sessions...
  # ...
  -(.*?)sessionid(.+?)

  # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
  -.*(/.+?)/.*?\1/.*?\1/

  # skip orig from archivsystem
  -.*(orig)\..*

  # both urls lead to the same site, which makes trouble crawling
  -^http://www.hfbk-hamburg.de/.+
  -^http://w3.hfbk-hamburg.de/.+
  # accept hosts in MY.DOMAIN.NAME
  +^http://([a-z0-9]*\.)*hfbk-hamburg.de/
  +^http://([a-z0-0]*\.)*ask23.de/
  +^http://([a-z0-0]*\.)*hfbk.net/
  +^http://([a-z0-0]*\.)*hfbk.org/
  # +^http://ask23.hfbk-hamburg.de/

  # skip everything else
  -.
  #+.*

 note: recrawling uses a different config file: crawl-urlfilter.txt

recrawl

see http://wiki.apache.org/nutch/IntranetRecrawl?highlight=%28recrawl%29

at ask23 (debian) paths are different:

 /usr/local/share/nutch/bin/recrawl /var/lib/tomcat4/webapps/ROOT /data/db/nutch/crawl/ask23 15 31

frontend

/var/lib/tomcat4/webapps/ROOT/WEB-INF/classes/org/nutch