Content

DelcySearch

Massive web search engine using Hadoop, Inverted Index and PageRank

DelcySearch is a large-scale web search engine prototype designed to process and retrieve information from massive collections of web pages.

The system implements fundamental technologies used in modern search engines such as:

Distributed data processing with Hadoop
Inverted Index for efficient document retrieval
PageRank for ranking the importance of web pages
Large-scale web crawling and indexing pipelines

DelcySearch demonstrates how large search platforms work internally by combining distributed computing, graph algorithms, and information retrieval techniques.

🎥 System Demo

🔎 Core Features

🌐 Massive web document indexing
⚡ Fast keyword-based search
📊 Page ranking using PageRank algorithm
🗂️ Distributed inverted index construction
🧠 Efficient query processing
☁️ Distributed processing with Hadoop MapReduce
📈 Scalable architecture for large datasets

⚙️ How the Search Engine Works

The system follows a pipeline similar to modern search engines:

Web Pages Dataset
        │
        ▼
Data Processing (Hadoop)
        │
        ▼
Inverted Index Generation
        │
        ▼
PageRank Calculation
        │
        ▼
Search Query Engine
        │
        ▼
Ranked Results to the User

Processing Steps

1️⃣ Large datasets of web pages are collected
2️⃣ Hadoop processes the documents in parallel
3️⃣ An Inverted Index is generated for fast word lookup
4️⃣ The PageRank algorithm computes page importance
5️⃣ User queries return ranked results based on relevance

🏗️ Project Structure

DelcySearch/
├── crawler/
│   └── scripts for collecting web pages
│
├── hadoop_jobs/
│   ├── inverted_index/
│   └── pagerank/
│
├── search_engine/
│   ├── query_processor/
│   └── ranking/
│
├── datasets/
│   └── sample web datasets
│
└── README.md

🧰 Technologies Used

Data Processing

Hadoop
MapReduce
Distributed Computing

Algorithms

Inverted Index
PageRank

Backend

Python
Java

Data

Large Web Datasets
Text Processing

🚀 Getting Started

1️⃣ Clone the repository

git clone https://github.com/checho1402/DelcySearch.git
cd DelcySearch

2️⃣ Prepare Hadoop environment

Make sure Hadoop is installed and configured.

3️⃣ Run the Inverted Index job

hadoop jar inverted_index.jar

4️⃣ Run the PageRank job

hadoop jar pagerank.jar

5️⃣ Execute search queries

Run the query processor to retrieve ranked results.

🧠 Concepts Implemented

DelcySearch demonstrates several core concepts in Information Retrieval and Distributed Systems:

Distributed processing of large text corpora
Graph-based ranking algorithms
Search indexing strategies
Parallel data processing pipelines

These concepts are fundamental for large-scale search engines.

📧 Contact

Email: slrv.ramosv@gmail.com
LinkedIn: sergioramosvillena
Phone: +51 932416666

Made with ❤️ by Sergio Ramos | 2024