Tharga.Crawler
A customizable web crawler framework for .NET, written in C#. Four pluggable components — downloader, page processor, scheduler, URI service — let you crawl a site with one line or replace any piece for headless browsers, persistent queues, custom link extraction, or non-HTML content. Targets .NET 8, .NET 9, and .NET 10.
Package
| Package | What it does |
|---|---|
| Tharga.Crawler | Crawler + ICrawler + ICrawlerProvider, the built-in HttpClientDownloader / BasicPageProcessor / MemoryScheduler, plus interfaces (IDownloader, IPageProcessor, IScheduler, IUriService) for replacing any component. |
Quick start
dotnet add package Tharga.Crawler
var crawler = new Crawler();
var result = await crawler.StartAsync(new Uri("https://example.com/"));
See Getting started for the full setup walkthrough.
What's in the box
Crawler— orchestrates the crawl, runs page processors in parallel, surfaces events (CrawlerCompleteEvent,PageCompleteEvent,PageFailedEvent). See Getting started.HttpClientDownloader— built-inIDownloaderusingHttpClient. Follows 301/302/303/307/308 redirects and tracks the full redirect chain. Configurable viaDownloadOptions(retry, timeout, user agent). See Configuration.BasicPageProcessor— built-inIPageProcessor. Uses HtmlAgilityPack to extract<a href="...">links and stays within the original domain.MemoryScheduler— built-inIScheduler. In-memory breadth-first queue with retry tracking. Configurable viaSchedulerOptions(max queue size). See Configuration.UriService— defaultIUriServicethat enforces same-domain crawls. OverrideShouldEnqueueAsync/MutateUriAsyncto filter or rewrite URIs. See Custom services.- Pluggable everything — every component is behind an interface. Swap
IDownloaderfor a headless browser,ISchedulerfor a database-backed queue,IPageProcessorfor non-HTML content. See Custom services.
Repo
github.com/Tharga/Crawler — source, issues, releases.