A command-line web crawler built in Node.js that recursively traverses hyperlinks and collects data from web pages.
Built as a hands-on project for learning how HTTP and the web work under the hood. The crawler recursively follows links starting from a given URL, respects robots.txt rules, and handles concurrent page fetching using async patterns.
- 🔗 Recursive link traversal with configurable depth limit
- ⚡ Concurrent page fetching via
p-limitfor controlled parallelism - 🔁 Visited URL tracking to prevent duplicate requests and infinite loops
- 🤖
robots.txtcompliance viarobots-parser - 🧪 Test suite powered by Jest
webcrawler/
├── src/ # Core crawler logic
├── tests/ # Jest test suites
├── main.js # Entry point
├── package.json
└── .nvmrc # Node version: 18.7.0
- Node.js
18.7.0(use nvm for version management)
nvm use # automatically picks up .nvmrcgit clone https://github.com/Lazzar19/webcrawler.git
cd webcrawler
npm installnode main.js <url>npm test| Package | Purpose |
|---|---|
jsdom |
HTML parsing and link extraction |
p-limit |
Concurrency limiter for async requests |
robots-parser |
Parses and respects robots.txt rules |
jest |
Testing framework (dev dependency) |