Skip to content

Lazzar19/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕷️ Web Crawler

A command-line web crawler built in Node.js that recursively traverses hyperlinks and collects data from web pages.

Node.js JavaScript Jest License

About

Built as a hands-on project for learning how HTTP and the web work under the hood. The crawler recursively follows links starting from a given URL, respects robots.txt rules, and handles concurrent page fetching using async patterns.

Features

  • 🔗 Recursive link traversal with configurable depth limit
  • ⚡ Concurrent page fetching via p-limit for controlled parallelism
  • 🔁 Visited URL tracking to prevent duplicate requests and infinite loops
  • 🤖 robots.txt compliance via robots-parser
  • 🧪 Test suite powered by Jest

Project Structure

webcrawler/
├── src/           # Core crawler logic
├── tests/         # Jest test suites
├── main.js        # Entry point
├── package.json
└── .nvmrc         # Node version: 18.7.0

Getting Started

Prerequisites

  • Node.js 18.7.0 (use nvm for version management)
nvm use   # automatically picks up .nvmrc

Install

git clone https://github.com/Lazzar19/webcrawler.git
cd webcrawler
npm install

Run

node main.js <url>

Tests

npm test

Dependencies

Package Purpose
jsdom HTML parsing and link extraction
p-limit Concurrency limiter for async requests
robots-parser Parses and respects robots.txt rules
jest Testing framework (dev dependency)

Author

Lazar NikolicGitHub · LinkedIn

About

project for learning http

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors