Issue #1022
Web scraping used to mean writing brittle scripts that broke whenever a site changed. Now AI agents can browse the web like humans do. They read pages, click buttons, fill forms, and extract data without you writing CSS selectors for every element.
This guide covers tools that make AI-powered browser automation possible. Some are low-level SDKs. Others handle everything from navigation to data extraction. Pick the right one based on how much control you need.
The Foundation: Playwright and Puppeteer
Before AI enters the picture, you need something to control the browser. That’s where Playwright and Puppeteer come in.
Puppeteer is Google’s Node.js library for Chrome automation. It talks directly to Chrome through the DevTools Protocol. Simple and fast for Chrome-only tasks.
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'screenshot.png' });
await browser.close();
Playwright does the same thing but works across Chrome, Firefox, and Safari. Microsoft built it with better auto-waiting, so you spend less time writing waitForSelector calls.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto('https://example.com')
page.screenshot(path='screenshot.png')
browser.close()
Neither tool has AI built in. They’re the hands. AI provides the brain.
Browser-Use: Let AI Control the Browser
Browser-Use is a Python library that connects language models to browser automation. You describe a task in plain English, and the AI figures out what to click and type.
Install it with Python 3.11 or higher:
uv init && uv add browser-use && uv sync
Here’s a basic example. The agent finds GitHub stars without you writing any selectors:
from browser_use import Agent, Browser, ChatBrowserUse
import asyncio
async def main():
browser = Browser()
agent = Agent(
task="Find the price of the MacBook Pro on Apple's website",
llm=ChatBrowserUse(),
browser=browser,
)
result = await agent.run()
print(result)
asyncio.run(main())
The library works with multiple LLM providers. Use ChatBrowserUse() for their optimized model, or swap in Claude, Gemini, or local models through Ollama. The AI sees the page, identifies interactive elements, and decides what actions to take.
Browser-Use shines when the task is clear but the path isn’t. Search for a product, compare prices across tabs, fill out a multi-step form. You describe the goal, and the agent handles navigation.
Browserbase: Cloud Browsers for Scale
Running browsers locally works for development. Production needs something else. Browserbase provides cloud browser infrastructure that handles scaling, proxies, and captchas.
Connect your existing Playwright code to Browserbase with minimal changes:
from playwright.sync_api import sync_playwright
from browserbase import Browserbase
import os
bb = Browserbase(api_key=os.environ["BROWSERBASE_API_KEY"])
with sync_playwright() as playwright:
session = bb.sessions.create()
browser = playwright.chromium.connect_over_cdp(session.connect_url)
page = browser.contexts[0].pages[0]
page.goto('https://example.com')
print(page.title())
browser.close()
The same code that runs locally now runs on Browserbase’s servers. They handle proxy rotation, fingerprint randomization, and captcha solving. Your scripts become more reliable without extra code.
Browserbase also offers Stagehand, their AI automation framework. It sits between raw Playwright and full AI agents:
import { Stagehand } from "@browserbasehq/stagehand";
const stagehand = new Stagehand({
env: "BROWSERBASE",
model: {
modelName: "google/gemini-3-flash-preview",
apiKey: process.env.MODEL_API_KEY,
}
});
await stagehand.init();
const page = stagehand.context.pages()[0];
await page.goto("https://news.ycombinator.com");
// AI handles the clicking
await stagehand.act("click on the comments link for the top story");
// Extract structured data
const data = await stagehand.extract("extract the title and points of the top story");
console.log(data);
await stagehand.close();
Use code when you know exactly what to do. Use natural language when the page structure might vary.
Firecrawl: Web Data for LLMs
Firecrawl takes a different approach. Instead of controlling a browser, it converts websites into clean data that AI can process.
Send a URL, get back markdown or structured JSON:
from firecrawl import Firecrawl
app = Firecrawl(api_key="fc-YOUR_API_KEY")
doc = app.scrape("https://example.com", formats=["markdown"])
print(doc["markdown"])
The real power is structured extraction. Define a schema, and Firecrawl pulls exactly the data you need:
from pydantic import BaseModel
from firecrawl import Firecrawl
class CompanyInfo(BaseModel):
name: str
description: str
is_hiring: bool
app = Firecrawl(api_key="fc-YOUR_API_KEY")
result = app.scrape(
"https://stripe.com",
formats=[{"type": "json", "schema": CompanyInfo.model_json_schema()}]
)
print(result)
Firecrawl also has an agent mode. Describe what data you want, and it searches and navigates to find it:
result = app.agent(prompt="Find the pricing for OpenAI's GPT-4 API")
print(result.data)
No URLs required. The AI figures out where to look.
agent-browser: CLI-First Browser Automation for AI Agents
Every tool covered so far works as a library you import into your code. agent-browser from Vercel Labs takes the opposite approach. It’s a CLI tool — a Rust daemon that persists between calls so your agent drives the browser through shell commands, not function calls.
Install it once and it’s available to any script, any language, any agent:
npm install -g agent-browser
agent-browser install # downloads Chrome for Testing
Or via Homebrew on macOS:
brew install agent-browser
agent-browser install
The Accessibility Tree, Not the DOM
Raw HTML is noise. Pixel coordinates break when layouts change. agent-browser’s snapshot command returns the accessibility tree — a structured view of what’s actually on the page, with stable @ref identifiers the AI can act on:
agent-browser open https://news.ycombinator.com
agent-browser snapshot # returns tree with refs like @e1, @e2...
agent-browser click @e5 # click by ref, no selectors needed
agent-browser get text @e5
This is how screen readers interpret pages. It’s semantic, stable, and immune to CSS refactors.
When refs aren’t reliable across page loads, semantic finders resolve elements by meaning instead:
agent-browser find role button click --name "Submit"
agent-browser find text "Sign In" click
agent-browser find label "Email" fill "test@example.com"
Batch Mode for Multi-Step Workflows
The biggest cost in shell-based automation is process startup. agent-browser solves this by running as a persistent daemon. You can also pipeline an entire sequence as a JSON array in a single call:
echo '[
["open", "https://example.com"],
["snapshot"],
["fill", "#search", "query"],
["press", "Enter"],
["screenshot", "result.png"]
]' | agent-browser batch --json
One invocation. No repeated startup overhead. The --bail flag stops on the first error if you need strict sequencing.
When to Use It
agent-browser fits pipelines where the AI is deciding what to do next and issuing commands one at a time — the pattern you get with Claude’s computer use, MCP tool calls, or any agent that treats the browser as a tool rather than an environment. The CLI interface means the orchestration layer can be written in anything: Python, Go, a shell script, or another agent.
It’s more hands-on than Browser-Use, which manages the agent loop for you. But that control is exactly what you want when the AI is making the decisions and you need fine-grained access to the page state between each step.
Start the conversation