June 25, 2007 at 10:02 pm
· Filed under Open Source Software
OCR -Optical character recognition- is a type of software designed to translate images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text. OCR also has the capability to translate pictures of characters into a standard encoding scheme representing them (e.g. ASCII or Unicode).
Why am I blogging about OCR? Well because Google has its finger in this Open Source Pie.
OCR History
Tesseract was the original OCR engine developed at the HP Labs between 1985 and 1995. HP decided to abandon OCR research and, for ten years, the software’s development has been frozen. In 2005, HP made Tesseract open source (Apache License) and Google, together with a research institute, have continued the development of the program.
Why is it important for Google to be invloved in OCR?
OCR is useful for Google Book Search and it could be useful for Picasa or Image Search in addition to an object recognition engine. And, if Google improves the software, it could be launched as a successful alternative to commercial applications.
Watch this space…
Permalink
June 17, 2007 at 10:10 pm
· Filed under Open Source Software
If you recall a few weeks back I posted about building your own google, well I finally did it, my very own google search engine is finally up and running.
It was not an easy job at all, and very very frustrating at times but very rewarding indeed!
There are still some issues that need to be ironed out, for instance the cache links among a few others give an error message when you click on them, but all in all these are minor issues compared to the hurdels I jumped over to get this engine going.
I managed to spider the cnn.com website (only a few pages as an experiment) and feed the resuts of the crawl into my search engine. Try searching for weather on CNN using my search engine and check out the results.
I will be experimenting further with nutch including deeper and multiple crawls as well as fixing the odd bug or two that currently exist.
I will also start reading up on the nutch technology to better understand it and to get a better feel for its potential.
Hopefully in the next few months I will begin creating new websites that will cater to vertical search and see where that takes me.
I’ll keep you all posted. In the meatime if you have any ideas or questions please feel free to post a comment or two.
Permalink
May 29, 2007 at 5:50 pm
· Filed under Open Source Software
I’ve recently (two days ago) started experimenting with building my own search engine. The inspiration came from my experimenting with Google’s Co-op Custom Search Engine contraption.
The software I’m using to build my very own Google is Nutch which is an open source web-search software.
Nutch builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats.
The tutorial I used to build my search engine (and its a great tutorial) can be found here. It basically guides you through the whole process of building the search engine including setting up the environment for Nutch (basically Java and Tomcat on an Apache server), installing Nutch, crawling and indexing your first site, searching your site and even using Regex URL Filters to tell the search engine to select which URLs on a page to crawl and which to will ignore.
I’ve yet to be successful at crawling my site (or any site for that matter) but will keep you posted on further developments.
In the meantime and for more information about Nutch, please see the Nutch wiki.
Permalink