Nuxt HN | Ask HN: Scaling a targeted web crawler beyond 500M pages/day

I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler").

Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern.

The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems.

For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance.

Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.

4 comments

faangguyindia 20 hours ago
If you want to access data from websites which prevent it, you gotta use a headless browser with Residential Proxy Network Like Bright Data (formerly Luminati).
[-]
- nicbou 17 hours ago
  Our industry's understanding of consent is terrifying
  [-]
  - jeong_jeong 13 hours ago
    It’s called hacker news, bro
4lx87 1 day ago
I'm curious, how do you deal with Cloudflare and similar anti-bot systems? Just keep shopping the job around to different proxies?
[-]
- faangguyindia 8 hours ago
  it's fairly simple, you use browser profiles and you visit multiple website like a normal guy using residential proxyy network
  and cloudflare cannot detect you this way.
  the older your browser profile is, the less often cloudflare bans.
- fragmede 13 hours ago
  Cloudflare reads this forum. By answering your question here, they burn that workaround. Why would someone do that? (No one bring up Warframe)
fragmede 13 hours ago
have you already incorporated common crawl into your index?