i a n  l i
Gonzo Search Engine
A Google-like search engine in Java

Autumn 2002
CSE 454, Advanced Internet Systems
Taught by Dr. Daniel S. Weld

Final report, architecture diagrams, and API documentation

Gonzo is a search engine implemented completely in Java. The architecture is based on the search engine described on the "Google" paper1. The implementation has the following modules (PDF diagram):

  • a repository for storing the pages indexed by the crawler
  • an indexer which indexes the words in all indexed pages
  • a sorter which sorts the words for easy retrieval
  • a PageRank calculator which computes the relevance of words
  • a server which is used to receive query requests and respond with results
  • a web interface as the front end for the search engine

The only downside to using Java for this search engine is its memory footprint. We store a dictionary of words which can contain hundreds of thousands of strings. Unfortunately, String objects in Java take up about 50 bytes, so the indexer would often run out of memory while indexing. To resolve this problem, I made some compromises to reduce the footprint. I implemented CompactString which stores 1-byte characters instead of 2-byte Unicode characters. CompactString also has fewer functions.

1

Brin, S. and Page, L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Stanford University, 1999.
i a n  l i