Back to Blog
Python Tutorial13 min readFeb 8, 2026

Web Scraping with Python & Proxies

Complete tutorial for scraping websites with Python, requests, BeautifulSoup, Scrapy, and residential proxies. Scale to millions of pages without getting blocked.

Master Python Web Scraping in 2026

Python is the best language for web scraping, offering powerful libraries like Requests, BeautifulSoup, and Scrapy. Combined with residential proxies, you can scrape any site at scale without getting blocked.

This guide covers everything from basic scraping to production-ready systems handling millions of pages. We'll show you practical code examples, anti-blocking strategies, and how to integrate residential proxies.

Production-Ready Code

Copy-paste examples with error handling, retries, and proxy rotation.

Bypass Anti-Bot Systems

Defeat Cloudflare, reCAPTCHA, and all major bot detection with residential IPs.

Scale to Millions

Architecture patterns for scraping 1M+ pages without bans.

Setup: Install Python Libraries

Install the essential scraping libraries:

terminal
# Core scraping libraries
pip install requests beautifulsoup4 lxml

# For advanced scraping
pip install scrapy playwright

# For data processing
pip install pandas

# Verify installation
python -c "import requests; print('Requests:', requests.__version__)"

Method 1: Requests + BeautifulSoup (Beginner-Friendly)

Best for simple scraping tasks. Easy to learn, perfect for static HTML pages, APIs, and small-scale projects.

Basic Scraping with Residential Proxies

basic_scraper.py
import requests
from bs4 import BeautifulSoup
import time
import random

# netdash residential proxy configuration
PROXY = "http://username:password@gate.netdash.io:8080"

proxies = {
    "http": PROXY,
    "https": PROXY
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
}

def scrape_page(url):
    """
    Scrape a single page with residential proxy
    """
    try:
        # Add human-like delay
        time.sleep(random.uniform(2, 5))
        
        response = requests.get(
            url,
            proxies=proxies,
            headers=headers,
            timeout=15
        )
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'lxml')
            
            # Extract data (example: product details)
            title = soup.find('h1', class_='product-title')
            price = soup.find('span', class_='price')
            description = soup.find('div', class_='description')
            
            return {
                'title': title.text.strip() if title else None,
                'price': price.text.strip() if price else None,
                'description': description.text.strip() if description else None,
                'url': url
            }
        else:
            print(f"Failed: {response.status_code}")
            return None
            
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

# Example: Scrape multiple product pages
product_urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3",
    # ... add more URLs
]

products = []
for url in product_urls:
    data = scrape_page(url)
    if data:
        products.append(data)
        print(f"✓ Scraped: {data['title']}")

print(f"\nTotal products scraped: {len(products)}")

Advanced: With Retry Logic & Error Handling

advanced_scraper.py
import requests
from bs4 import BeautifulSoup
import time
import random
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

class ProductScraper:
    def __init__(self, proxy_url):
        self.proxies = {
            "http": proxy_url,
            "https": proxy_url
        }
        
        # Setup session with retry strategy
        self.session = requests.Session()
        retry_strategy = Retry(
            total=3,
            backoff_factor=2,
            status_forcelist=[429, 500, 502, 503, 504]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        self.session.mount("http://", adapter)
        self.session.mount("https://", adapter)
        
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })
    
    def scrape(self, url, max_retries=3):
        """
        Scrape with automatic retries and error handling
        """
        for attempt in range(max_retries):
            try:
                # Random delay (2-5 seconds)
                time.sleep(random.uniform(2, 5))
                
                response = self.session.get(
                    url,
                    proxies=self.proxies,
                    timeout=15
                )
                
                if response.status_code == 200:
                    return self._parse_response(response)
                elif response.status_code == 403:
                    print(f"⚠️  Blocked (attempt {attempt + 1}/{max_retries})")
                    continue
                else:
                    print(f"Status {response.status_code}")
                    continue
                    
            except requests.exceptions.Timeout:
                print(f"Timeout (attempt {attempt + 1}/{max_retries})")
                continue
            except requests.exceptions.RequestException as e:
                print(f"Error: {e}")
                continue
        
        return None
    
    def _parse_response(self, response):
        """
        Parse HTML and extract data
        """
        soup = BeautifulSoup(response.text, 'lxml')
        
        # Your parsing logic here
        data = {
            'title': soup.find('h1', class_='title').text.strip(),
            'price': soup.find('span', class_='price').text.strip(),
            # ... extract more fields
        }
        
        return data

# Usage
scraper = ProductScraper(
    proxy_url="http://username:password@gate.netdash.io:8080"
)

urls = ["https://example.com/product/1", "..."]
results = [scraper.scrape(url) for url in urls]

print(f"Scraped {len([r for r in results if r])} / {len(urls)} pages")

Method 2: Scrapy (Production-Scale Scraping)

Scrapy is Python's most powerful scraping framework. Built-in async requests, proxy middleware, and scales to millions of pages. Perfect for large projects.

settings.py
# Scrapy settings with netdash residential proxies

# Enable proxy middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Configure residential proxy
PROXY = 'http://username:password@gate.netdash.io:8080'

# Respectful scraping settings
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 2  # 2 seconds between requests
RANDOMIZE_DOWNLOAD_DELAY = True

# User-Agent rotation
USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]

# Retry settings
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Auto-throttle (adjust speed based on response)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
product_spider.py
import scrapy
from scrapy.http import Request

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products?page=1']
    
    # Custom settings for this spider
    custom_settings = {
        'CONCURRENT_REQUESTS': 16,
        'DOWNLOAD_DELAY': 2,
    }
    
    def start_requests(self):
        """
        Add proxy to all requests
        """
        proxy = 'http://username:password@gate.netdash.io:8080'
        
        for url in self.start_urls:
            yield Request(
                url,
                callback=self.parse,
                meta={'proxy': proxy},
                dont_filter=True
            )
    
    def parse(self, response):
        """
        Parse product listing page
        """
        # Extract product links
        product_links = response.css('a.product-link::attr(href)').getall()
        
        for link in product_links:
            yield response.follow(
                link,
                callback=self.parse_product,
                meta={'proxy': response.meta['proxy']}
            )
        
        # Follow pagination
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(
                next_page,
                callback=self.parse,
                meta={'proxy': response.meta['proxy']}
            )
    
    def parse_product(self, response):
        """
        Parse individual product page
        """
        yield {
            'title': response.css('h1.product-title::text').get(),
            'price': response.css('span.price::text').get(),
            'description': response.css('div.description::text').get(),
            'rating': response.css('span.rating::text').get(),
            'reviews': response.css('span.review-count::text').get(),
            'url': response.url,
        }

# Run spider:
# scrapy crawl products -o products.json

Best Practices for Python Web Scraping

Always use residential proxies for protected sites
Datacenter IPs get banned instantly. Residential proxies give you 95%+ success rates.
Add random delays (2-5 seconds)
Mimic human browsing. Never scrape at maximum speed even with proxies.
Rotate User-Agent headers
Cycle through realistic browser User-Agents. Don't use the same one for millions of requests.
Handle errors gracefully
Implement retry logic with exponential backoff. Don't crash on timeouts or 403s.
Parse HTML with lxml (fastest)
BeautifulSoup with lxml parser is 10x faster than html.parser. Use: BeautifulSoup(html, 'lxml')
Store data incrementally
Don't wait until the end to save data. Write to database/file every 100-1000 pages.
Respect robots.txt (legally safer)
Check robots.txt and respect disallowed paths. Reduces legal risk.
Monitor your success rates
Track how many requests succeed. If <90%, slow down or improve headers.

Common Challenges & Solutions

Challenge: JavaScript-Rendered Content

Problem: Many modern sites use React/Vue/Angular. Requests library only sees empty HTML.

Solutions:

  • • Use Playwright/Selenium (slower but handles JS)
  • • Reverse-engineer the API (find XHR requests in DevTools)
  • • Use Scrapy-Playwright for production-scale JS scraping

Challenge: Getting Blocked / CAPTCHAs

Problem: Site shows CAPTCHAs or returns 403 errors.

Solutions:

  • • Switch to residential proxies (95%+ success vs 20% for datacenter)
  • • Slow down request rate (add 3-5 second delays)
  • • Fix your headers (add realistic User-Agent, Accept-Language)
  • • Use netdash sticky sessions for multi-step flows

Challenge: Rate Limits & IP Bans

Problem: Your IP gets banned after 100-1000 requests.

Solutions:

  • • Use rotating residential proxies (new IP every request)
  • • Implement exponential backoff on errors
  • • Distribute requests across time (don't scrape 10K pages in 1 hour)
  • • Monitor response codes and adjust speed dynamically

Scale Your Python Scraping with netdash

67M+ residential IPs, automatic rotation, 99.9% uptime. Perfect for Python scrapers using Requests, Scrapy, or Playwright. Starting at $1.00/GB.

Frequently Asked Questions

What's the best Python library for web scraping with proxies?

For beginners: Requests + BeautifulSoup (simple, easy to learn). For production: Scrapy (built-in proxy rotation, async, scales to millions of pages). For JavaScript-heavy sites: Playwright or Selenium with proxy support. All work excellently with netdash residential proxies.

Do I need proxies for web scraping in Python?

Yes, for any serious scraping. Without proxies, you'll get IP banned after 10-100 requests on most sites. Residential proxies let you scale to millions of pages with 95%+ success rates. They're essential for e-commerce scraping, social media automation, and any protected site.

How do I avoid getting blocked while scraping with Python?

Use residential proxies (95%+ success vs 20% for datacenter), add random delays (2-5 seconds), rotate User-Agent headers, respect robots.txt, handle cookies properly, and implement retry logic. Start slow (10-100 requests) before scaling to thousands.

Can Python scrape JavaScript-rendered sites?

Standard requests library can't execute JavaScript. Use Playwright, Selenium, or Pyppeteer for JavaScript-heavy sites (SPAs, React apps). These tools support proxies and can scrape dynamic content. For API-based SPAs, reverse-engineer the API calls—it's faster than browser automation.