Building your own Google
I’ve recently (two days ago) started experimenting with building my own search engine. The inspiration came from my experimenting with Google’s Co-op Custom Search Engine contraption.
The software I’m using to build my very own Google is Nutch which is an open source web-search software.
Nutch builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats.
The tutorial I used to build my search engine (and its a great tutorial) can be found here. It basically guides you through the whole process of building the search engine including setting up the environment for Nutch (basically Java and Tomcat on an Apache server), installing Nutch, crawling and indexing your first site, searching your site and even using Regex URL Filters to tell the search engine to select which URLs on a page to crawl and which to will ignore.
I’ve yet to be successful at crawling my site (or any site for that matter) but will keep you posted on further developments.
In the meantime and for more information about Nutch, please see the Nutch wiki.


























