Gonzo Search Engine
A Google-like search engine
in Java
Autumn 2002
CSE 454, Advanced Internet Systems
Taught by Dr. Daniel S. Weld
Final report, architecture diagrams, and API documentation
Gonzo is a search engine implemented completely in Java. The architecture is based on the search engine described on the "Google" paper1. The implementation has the following modules (PDF diagram):
- a repository for storing the pages indexed by the crawler
- an indexer which indexes the words in all indexed pages
- a sorter which sorts the words for easy retrieval
- a PageRank calculator which computes the relevance of words
- a server which is used to receive query requests and respond with results
- a web interface as the front end for the search engine
The only downside to using Java for this search engine is its memory footprint. We store a dictionary of words which can contain hundreds of thousands of strings. Unfortunately, String objects in Java take up about 50 bytes, so the indexer would often run out of memory while indexing. To resolve this problem, I made some compromises to reduce the footprint. I implemented CompactString which stores 1-byte characters instead of 2-byte Unicode characters. CompactString also has fewer functions. |