initialer crawl aufruf
/usr/local/share/nutch/bin/nutch crawl urls -dir ask23 -depth 15 -topN 10000
periodischer aufruf (cronjob gesteuert)
...
ausgangsseiten für den crawl prozess
/crawl/urls/urls
http://ask23.hfbk-hamburg.de/draft/ http://ask23.hfbk-hamburg.de/cgi-bin/wiki/wiki-ask23.pl http://projekte.ask23.de/ http://www.hfbk.net/ http://lfbmedien.hfbk.net/ http://lern.hfbk-hamburg.de/
/usr/local/share/nutch-0.8.1/conf/crawl-urlfilter.txt
# The url filter file used by the crawl command.
# Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls -^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ # unless no use, don't parse rdf, pdf, ..... -\.(dat|rdf|txt|pdf)$
# skip URLs containing certain characters as probable queries, etc. #-[?*!@=]
# dont skip wiki pages +(.*?)wiki-ask23.pl?.+ # skip metadata searches -(.*?)ask23-suche1.pl?.+ -(.*?)ask23-suche.pl?.+ # skip sessions... # ... -(.*?)sessionid(.+?)
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/
# skip orig from archivsystem -.*(orig)\..*
# both urls lead to the same site, which makes trouble crawling -^http://www.hfbk-hamburg.de/.+ -^http://w3.hfbk-hamburg.de/.+ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*hfbk-hamburg.de/ +^http://([a-z0-0]*\.)*ask23.de/ +^http://([a-z0-0]*\.)*hfbk.net/ +^http://([a-z0-0]*\.)*hfbk.org/ # +^http://ask23.hfbk-hamburg.de/
# skip everything else -. #+.*
note: recrawling uses a different config file: crawl-urlfilter.txt
see http://wiki.apache.org/nutch/IntranetRecrawl?highlight=%28recrawl%29
at ask23 (debian) paths are different:
/usr/local/share/nutch/bin/recrawl /var/lib/tomcat4/webapps/ROOT /data/db/nutch/crawl/ask23 15 31
/var/lib/tomcat4/webapps/ROOT/WEB-INF/classes/org/nutch