Ghosts in the Machine.
How skilled developers scrape the web at scale — without leaving a trace.

You followed a tutorial. You wrote the script. You ran it — and for a glorious thirty seconds, data poured in. Then: 403 Forbidden. Silence. The website kicked you out.
Sound familiar? You’re not alone. Basic scrapers work great in tutorials, but the real web is a different animal. Websites have rate limiters, bot detectors, CAPTCHAs, and pages that don’t even load without JavaScript. Getting blocked isn’t a sign you did something wrong — it’s a sign your scraper was too obvious.
Websites can’t stop you from reading public data. They can only make it harder for scripts that don’t look human. The solution is to look more human.
In this post, we’ll cover three techniques that will take your scraper from “blocked on first run” to “quietly collecting data like a professional.” But first — the rules of the road.
The permission slip: robots.txt
Before you scrape anything
Every well-behaved website publishes a file called robots.txt. It lives at the root of the domain — you can see it right now by visiting any major site and adding /robots.txt to the URL (try https://github.com/robots.txt).
Think of it as a sticky note on the front door: “Bots, here’s what you can and can’t touch.” A typical file looks like this:
User-agent: *
Disallow: /private/
Disallow: /accounts/
Allow: /blog/
Crawl-delay: 5Translation: all bots are welcome to scrape /blog/, but stay away from private and account pages — and wait 5 seconds between requests.
Python’s standard library makes checking this trivially easy:
python
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
target = "https://example.com/blog/post-42"
if rp.can_fetch("*", target):
print("✓ Allowed — safe to scrape")
else:
print("✗ Disallowed — skipping this URL")⚠ Legal heads-up: Scraping pages that robots.txt explicitly disallows has been the basis for real legal action in multiple countries. Beyond the ethics, it’s a practical risk. Before building anything serious, also check: does the site have a public API? Using it is always faster, more stable, and completely above board.
Technique 1 — Slow down, look human
Rate limiting & browser headers
Here’s the irony: the most effective anti-blocking technique costs you nothing. A basic scraper gets banned not because it’s clever, but because it’s fast. While a human clicks a link every few seconds, a script can fire 50 requests per second. That pattern is unmistakeable.
Two small fixes solve most blocking problems immediately:
1. Add random delays between requests using random.uniform(). A fixed 2-second pause is easy for bot detectors to spot — a randomised 2–6 second range looks much more human.
2. Set realistic browser headers. By default, Python’s requests library announces itself as python-requests/2.x — an immediate red flag. Send a real Chrome User-Agent string and a couple of supporting headers instead.
python
import requests, time, random
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
urls = ["https://example.com/page-1", "https://example.com/page-2"]
for url in urls:
response = requests.get(url, headers=headers)
print(response.status_code, url)
time.sleep(random.uniform(2, 6)) # human-ish pause✓ Quick win: Just adding proper headers and random delays solves the vast majority of basic scraping blocks. Many sites aren’t running sophisticated bot detection — they’re simply looking for the most obvious tells. Start here before reaching for anything more complex.
Technique 2 — Rotate your identity
Proxy pools
Even a perfectly-paced, well-headered scraper will eventually trigger rate limits if you’re hitting the same site from the same IP address all day. Websites keep logs. After enough requests from one address, that address gets flagged — or quietly throttled.
The solution is proxy rotation: instead of every request coming from your IP, they come from a constantly-changing pool of IP addresses.
What’s a proxy, exactly? A proxy is a middleman server. Your request goes to the proxy, the proxy forwards it to the website, and the website sees the proxy’s IP — not yours. Rotate through many proxies and no single IP ever builds up a suspicious request count.
python
import requests, random
proxy_pool = [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
]
def scrape(url):
proxy = random.choice(proxy_pool)
try:
r = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=10
)
return r
except Exception as e:
print(f"Proxy failed ({e}), try again")
return None⚠ Free proxies — a word of caution: Free public proxy lists are tempting, but most are already flagged by major sites, and many are slow or unreliable. For anything beyond casual experimentation, a paid service like Bright Data, Oxylabs, or ScraperAPI manages clean residential IPs for you — worth the cost if your data actually matters.
Technique 3 — Headless browsers
When the page needs JavaScript to exist
Here’s a curveball the modern web throws at scrapers: many pages don’t actually contain their data in the HTML you download. They send you a near-empty skeleton, then use JavaScript to fetch and render the real content. If you scrape the raw HTML, you get nothing useful.
The fix is a headless browser — a real browser engine (like Chrome) that runs invisibly, executes all the JavaScript, waits for everything to load, and then lets you read the fully-rendered page.
The go-to tool for this in Python is Playwright:
python
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/js-rendered-page")
# Wait until the content we want actually appears
page.wait_for_selector(".product-grid")
products = page.query_selector_all(".product-card")
for item in products:
name = item.query_selector(".name").inner_text()
price = item.query_selector(".price").inner_text()
print(f"{name} → {price}")
browser.close()Playwright vs Selenium: Selenium is the classic choice and still perfectly usable. Playwright is the modern upgrade: faster startup, cleaner API, better async support, and built-in waiting for elements. For new projects, Playwright is the easier path. That said — only reach for a headless browser when a site actually requires JavaScript. For static HTML pages, plain requests + BeautifulSoup is far faster and lighter.
The full picture — all three techniques together
Here’s what a production-ready, well-behaved scraper looks like when you combine everything:
python
import requests, time, random
from urllib.robotparser import RobotFileParser
# Step 1 — check what we're allowed to scrape
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
headers = {
"User-Agent": "Mozilla/5.0 Chrome/120.0.0.0 Safari/537.36"
}
proxies = ["http://p1:8080", "http://p2:8080", "http://p3:8080"]
urls = ["https://example.com/data/1", "https://example.com/data/2"]
for url in urls:
# Step 2 — skip if robots.txt says no
if not rp.can_fetch("*", url):
print(f"Skipping (not allowed): {url}")
continue
# Step 3 — pick a random proxy
proxy = random.choice(proxies)
# Step 4 — make the request
resp = requests.get(url, headers=headers,
proxies={"https": proxy}, timeout=10)
print(f"{resp.status_code} {url}")
# Step 5 — breathe
time.sleep(random.uniform(3, 8))Quick reference — the four tools
ToolWhat it doesWhen to use itrobots.txtChecks what you’re allowed to scrapeAlways, before anything elseHuman headers + delaysMasks bot-like behaviourEvery project, no exceptionsProxy rotationSpreads requests across many IPsHigh-volume or long-running scrapesPlaywrightRuns a real browser to load JSOnly when the site requires JavaScript
Key takeaway
Scrape like a guest, not a bulldozer.
Most scraping blocks are preventable. Check robots.txt, slow down your requests, randomise your delays, and rotate your headers and proxies. That combination handles the overwhelming majority of real-world blocking scenarios.
Reach for Playwright only when a site actually needs JavaScript — it’s powerful but slower, so don’t use it by default. And always ask: is there a public API? If there is, use it. It’s faster, cleaner, and there’s no ethical grey area.
Scraping is one of the most practical data skills you can build. Treat websites with respect, and they’ll keep being useful to you.

