Master Python Web Scraping in 2026
Python is the best language for web scraping, offering powerful libraries like Requests, BeautifulSoup, and Scrapy. Combined with residential proxies, you can scrape any site at scale without getting blocked.
This guide covers everything from basic scraping to production-ready systems handling millions of pages. We'll show you practical code examples, anti-blocking strategies, and how to integrate residential proxies.
Production-Ready Code
Copy-paste examples with error handling, retries, and proxy rotation.
Bypass Anti-Bot Systems
Defeat Cloudflare, reCAPTCHA, and all major bot detection with residential IPs.
Scale to Millions
Architecture patterns for scraping 1M+ pages without bans.
Setup: Install Python Libraries
Install the essential scraping libraries:
# Core scraping libraries
pip install requests beautifulsoup4 lxml
# For advanced scraping
pip install scrapy playwright
# For data processing
pip install pandas
# Verify installation
python -c "import requests; print('Requests:', requests.__version__)"Method 1: Requests + BeautifulSoup (Beginner-Friendly)
Best for simple scraping tasks. Easy to learn, perfect for static HTML pages, APIs, and small-scale projects.
Basic Scraping with Residential Proxies
import requests
from bs4 import BeautifulSoup
import time
import random
# netdash residential proxy configuration
PROXY = "http://username:password@gate.netdash.io:8080"
proxies = {
"http": PROXY,
"https": PROXY
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
}
def scrape_page(url):
"""
Scrape a single page with residential proxy
"""
try:
# Add human-like delay
time.sleep(random.uniform(2, 5))
response = requests.get(
url,
proxies=proxies,
headers=headers,
timeout=15
)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
# Extract data (example: product details)
title = soup.find('h1', class_='product-title')
price = soup.find('span', class_='price')
description = soup.find('div', class_='description')
return {
'title': title.text.strip() if title else None,
'price': price.text.strip() if price else None,
'description': description.text.strip() if description else None,
'url': url
}
else:
print(f"Failed: {response.status_code}")
return None
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
# Example: Scrape multiple product pages
product_urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
# ... add more URLs
]
products = []
for url in product_urls:
data = scrape_page(url)
if data:
products.append(data)
print(f"✓ Scraped: {data['title']}")
print(f"\nTotal products scraped: {len(products)}")Advanced: With Retry Logic & Error Handling
import requests
from bs4 import BeautifulSoup
import time
import random
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
class ProductScraper:
def __init__(self, proxy_url):
self.proxies = {
"http": proxy_url,
"https": proxy_url
}
# Setup session with retry strategy
self.session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=2,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
self.session.mount("http://", adapter)
self.session.mount("https://", adapter)
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
def scrape(self, url, max_retries=3):
"""
Scrape with automatic retries and error handling
"""
for attempt in range(max_retries):
try:
# Random delay (2-5 seconds)
time.sleep(random.uniform(2, 5))
response = self.session.get(
url,
proxies=self.proxies,
timeout=15
)
if response.status_code == 200:
return self._parse_response(response)
elif response.status_code == 403:
print(f"⚠️ Blocked (attempt {attempt + 1}/{max_retries})")
continue
else:
print(f"Status {response.status_code}")
continue
except requests.exceptions.Timeout:
print(f"Timeout (attempt {attempt + 1}/{max_retries})")
continue
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
continue
return None
def _parse_response(self, response):
"""
Parse HTML and extract data
"""
soup = BeautifulSoup(response.text, 'lxml')
# Your parsing logic here
data = {
'title': soup.find('h1', class_='title').text.strip(),
'price': soup.find('span', class_='price').text.strip(),
# ... extract more fields
}
return data
# Usage
scraper = ProductScraper(
proxy_url="http://username:password@gate.netdash.io:8080"
)
urls = ["https://example.com/product/1", "..."]
results = [scraper.scrape(url) for url in urls]
print(f"Scraped {len([r for r in results if r])} / {len(urls)} pages")Method 2: Scrapy (Production-Scale Scraping)
Scrapy is Python's most powerful scraping framework. Built-in async requests, proxy middleware, and scales to millions of pages. Perfect for large projects.
# Scrapy settings with netdash residential proxies
# Enable proxy middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Configure residential proxy
PROXY = 'http://username:password@gate.netdash.io:8080'
# Respectful scraping settings
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 2 # 2 seconds between requests
RANDOMIZE_DOWNLOAD_DELAY = True
# User-Agent rotation
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]
# Retry settings
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Auto-throttle (adjust speed based on response)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0import scrapy
from scrapy.http import Request
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products?page=1']
# Custom settings for this spider
custom_settings = {
'CONCURRENT_REQUESTS': 16,
'DOWNLOAD_DELAY': 2,
}
def start_requests(self):
"""
Add proxy to all requests
"""
proxy = 'http://username:password@gate.netdash.io:8080'
for url in self.start_urls:
yield Request(
url,
callback=self.parse,
meta={'proxy': proxy},
dont_filter=True
)
def parse(self, response):
"""
Parse product listing page
"""
# Extract product links
product_links = response.css('a.product-link::attr(href)').getall()
for link in product_links:
yield response.follow(
link,
callback=self.parse_product,
meta={'proxy': response.meta['proxy']}
)
# Follow pagination
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(
next_page,
callback=self.parse,
meta={'proxy': response.meta['proxy']}
)
def parse_product(self, response):
"""
Parse individual product page
"""
yield {
'title': response.css('h1.product-title::text').get(),
'price': response.css('span.price::text').get(),
'description': response.css('div.description::text').get(),
'rating': response.css('span.rating::text').get(),
'reviews': response.css('span.review-count::text').get(),
'url': response.url,
}
# Run spider:
# scrapy crawl products -o products.jsonBest Practices for Python Web Scraping
Common Challenges & Solutions
Challenge: JavaScript-Rendered Content
Problem: Many modern sites use React/Vue/Angular. Requests library only sees empty HTML.
Solutions:
- • Use Playwright/Selenium (slower but handles JS)
- • Reverse-engineer the API (find XHR requests in DevTools)
- • Use Scrapy-Playwright for production-scale JS scraping
Challenge: Getting Blocked / CAPTCHAs
Problem: Site shows CAPTCHAs or returns 403 errors.
Solutions:
- • Switch to residential proxies (95%+ success vs 20% for datacenter)
- • Slow down request rate (add 3-5 second delays)
- • Fix your headers (add realistic User-Agent, Accept-Language)
- • Use netdash sticky sessions for multi-step flows
Challenge: Rate Limits & IP Bans
Problem: Your IP gets banned after 100-1000 requests.
Solutions:
- • Use rotating residential proxies (new IP every request)
- • Implement exponential backoff on errors
- • Distribute requests across time (don't scrape 10K pages in 1 hour)
- • Monitor response codes and adjust speed dynamically
Scale Your Python Scraping with netdash
67M+ residential IPs, automatic rotation, 99.9% uptime. Perfect for Python scrapers using Requests, Scrapy, or Playwright. Starting at $1.00/GB.
Frequently Asked Questions
What's the best Python library for web scraping with proxies?▼
For beginners: Requests + BeautifulSoup (simple, easy to learn). For production: Scrapy (built-in proxy rotation, async, scales to millions of pages). For JavaScript-heavy sites: Playwright or Selenium with proxy support. All work excellently with netdash residential proxies.
Do I need proxies for web scraping in Python?▼
Yes, for any serious scraping. Without proxies, you'll get IP banned after 10-100 requests on most sites. Residential proxies let you scale to millions of pages with 95%+ success rates. They're essential for e-commerce scraping, social media automation, and any protected site.
How do I avoid getting blocked while scraping with Python?▼
Use residential proxies (95%+ success vs 20% for datacenter), add random delays (2-5 seconds), rotate User-Agent headers, respect robots.txt, handle cookies properly, and implement retry logic. Start slow (10-100 requests) before scaling to thousands.
Can Python scrape JavaScript-rendered sites?▼
Standard requests library can't execute JavaScript. Use Playwright, Selenium, or Pyppeteer for JavaScript-heavy sites (SPAs, React apps). These tools support proxies and can scrape dynamic content. For API-based SPAs, reverse-engineer the API calls—it's faster than browser automation.