Content

DelcySearch

DelcySearch

Massive web search engine using Hadoop, Inverted Index and PageRank

DelcySearch is a large-scale web search engine prototype designed to process and retrieve information from massive collections of web pages.

The system implements fundamental technologies used in modern search engines such as:

  • Distributed data processing with Hadoop
  • Inverted Index for efficient document retrieval
  • PageRank for ranking the importance of web pages
  • Large-scale web crawling and indexing pipelines

DelcySearch demonstrates how large search platforms work internally by combining distributed computing, graph algorithms, and information retrieval techniques.

🎥 System Demo

🔎 Core Features

  • 🌐 Massive web document indexing
  • Fast keyword-based search
  • 📊 Page ranking using PageRank algorithm
  • 🗂️ Distributed inverted index construction
  • 🧠 Efficient query processing
  • ☁️ Distributed processing with Hadoop MapReduce
  • 📈 Scalable architecture for large datasets

⚙️ How the Search Engine Works

The system follows a pipeline similar to modern search engines:

Web Pages Dataset


Data Processing (Hadoop)


Inverted Index Generation


PageRank Calculation


Search Query Engine


Ranked Results to the User

Processing Steps

1️⃣ Large datasets of web pages are collected
2️⃣ Hadoop processes the documents in parallel
3️⃣ An Inverted Index is generated for fast word lookup
4️⃣ The PageRank algorithm computes page importance
5️⃣ User queries return ranked results based on relevance

🏗️ Project Structure

DelcySearch/
├── crawler/
│   └── scripts for collecting web pages

├── hadoop_jobs/
│   ├── inverted_index/
│   └── pagerank/

├── search_engine/
│   ├── query_processor/
│   └── ranking/

├── datasets/
│   └── sample web datasets

└── README.md

🧰 Technologies Used

Data Processing

  • Hadoop
  • MapReduce
  • Distributed Computing

Algorithms

  • Inverted Index
  • PageRank

Backend

  • Python
  • Java

Data

  • Large Web Datasets
  • Text Processing

🚀 Getting Started

1️⃣ Clone the repository

git clone https://github.com/checho1402/DelcySearch.git
cd DelcySearch

2️⃣ Prepare Hadoop environment

Make sure Hadoop is installed and configured.

3️⃣ Run the Inverted Index job

hadoop jar inverted_index.jar

4️⃣ Run the PageRank job

hadoop jar pagerank.jar

5️⃣ Execute search queries

Run the query processor to retrieve ranked results.

🧠 Concepts Implemented

DelcySearch demonstrates several core concepts in Information Retrieval and Distributed Systems:

  • Distributed processing of large text corpora
  • Graph-based ranking algorithms
  • Search indexing strategies
  • Parallel data processing pipelines

These concepts are fundamental for large-scale search engines.

📧 Contact

Made with ❤️ by Sergio Ramos | 2024