Tharga.Crawler

A customizable web crawler framework for .NET, written in C#. Four pluggable components — downloader, page processor, scheduler, URI service — let you crawl a site with one line or replace any piece for headless browsers, persistent queues, custom link extraction, or non-HTML content. Targets .NET 8, .NET 9, and .NET 10.

Package

Package	What it does
Tharga.Crawler	`Crawler` + `ICrawler` + `ICrawlerProvider`, the built-in `HttpClientDownloader` / `BasicPageProcessor` / `MemoryScheduler`, plus interfaces (`IDownloader`, `IPageProcessor`, `IScheduler`, `IUriService`) for replacing any component.

Quick start

dotnet add package Tharga.Crawler

var crawler = new Crawler();
var result = await crawler.StartAsync(new Uri("https://example.com/"));

See Getting started for the full setup walkthrough.

What's in the box

Crawler — orchestrates the crawl, runs page processors in parallel, surfaces events (CrawlerCompleteEvent, PageCompleteEvent, PageFailedEvent). See Getting started.
HttpClientDownloader — built-in IDownloader using HttpClient. Follows 301/302/303/307/308 redirects and tracks the full redirect chain. Configurable via DownloadOptions (retry, timeout, user agent). See Configuration.
BasicPageProcessor — built-in IPageProcessor. Uses HtmlAgilityPack to extract <a href="..."> links and stays within the original domain.
MemoryScheduler — built-in IScheduler. In-memory breadth-first queue with retry tracking. Configurable via SchedulerOptions (max queue size). See Configuration.
UriService — default IUriService that enforces same-domain crawls. Override ShouldEnqueueAsync / MutateUriAsync to filter or rewrite URIs. See Custom services.
Pluggable everything — every component is behind an interface. Swap IDownloader for a headless browser, IScheduler for a database-backed queue, IPageProcessor for non-HTML content. See Custom services.

Repo

github.com/Tharga/Crawler — source, issues, releases.