Files
page-agent/packages/extension/structure.md

11 KiB

PageAgentExt Architecture

This document describes the architecture of the Chrome extension version of PageAgent, including environment definitions, communication protocols, and extension considerations.

Environment Definitions

The extension operates across three isolated JavaScript contexts:

1. Background (Service Worker)

File: src/entrypoints/background.ts

Responsibilities:

  • Hosts the headless PageAgentCore instance
  • Manages agent lifecycle (create, execute, stop, dispose)
  • Stores LLM configuration in chrome.storage.local
  • Receives commands from SidePanel via messaging
  • Broadcasts events to SidePanel for UI updates
  • Uses RemotePageController to proxy DOM operations to ContentScript

Key Components:

  • PageAgentCore - The AI agent (from @page-agent/core)
  • RemotePageController - Proxy that forwards calls to ContentScript
  • Command handlers for agent:execute, agent:stop, agent:configure

2. Content Script

File: src/entrypoints/content.ts

Responsibilities:

  • Runs in the context of web pages
  • Hosts the real PageController instance
  • Performs actual DOM operations (click, input, scroll, etc.)
  • Responds to RPC messages from Background
  • Manages visual mask overlay during automation

Key Components:

  • PageController - DOM controller (from @page-agent/page-controller)
  • RPC handlers for all PageController methods

3. Side Panel (React UI)

Files: src/entrypoints/sidepanel/

Responsibilities:

  • Provides user interface for controlling the agent
  • Displays task input and execution history
  • Shows real-time agent activity (thinking, executing, etc.)
  • Manages LLM configuration settings
  • Sends commands to Background and receives event updates

Key Components:

  • App.tsx - Main React component with chat-style UI
  • ConfigPanel - Settings form for LLM configuration
  • Event subscription for real-time updates

Communication Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Side Panel                              │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────┐  │
│  │  Task Input  │  │ Event Stream │  │   History Display     │  │
│  └──────┬───────┘  └──────▲───────┘  └───────────────────────┘  │
└─────────┼─────────────────┼─────────────────────────────────────┘
          │ Commands        │ Events
          ▼                 │
┌─────────────────────────────────────────────────────────────────┐
│                        Background                               │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    PageAgentCore                          │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────────┐  │   │
│  │  │     LLM     │  │    Tools    │  │ RemotePageCtrl   │  │   │
│  │  └─────────────┘  └─────────────┘  └────────┬─────────┘  │   │
│  └───────────────────────────────────────────────┼───────────┘   │
└───────────────────────────────────────────────────┼──────────────┘
                                                    │ RPC
                                                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Content Script                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    PageController                         │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────────┐  │   │
│  │  │  DOM Tree   │  │   Actions   │  │      Mask        │  │   │
│  │  └─────────────┘  └─────────────┘  └──────────────────┘  │   │
│  └───────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                            ┌───────────────┐
                            │   Web Page    │
                            │     DOM       │
                            └───────────────┘

Message Protocol

All cross-context communication uses @webext-core/messaging for type safety.

Protocol Definition

File: src/messaging/protocol.ts

1. RPC Protocol (Background → ContentScript)

Used by RemotePageController to call PageController methods.

interface PageControllerRPCProtocol {
  // State queries
  'rpc:getCurrentUrl': () => string
  'rpc:getLastUpdateTime': () => number
  'rpc:getBrowserState': () => BrowserState

  // DOM operations
  'rpc:updateTree': () => string
  'rpc:cleanUpHighlights': () => void

  // Element actions
  'rpc:clickElement': (index: number) => ActionResult
  'rpc:inputText': (data: { index: number; text: string }) => ActionResult
  'rpc:selectOption': (data: { index: number; optionText: string }) => ActionResult
  'rpc:scroll': (options: ScrollOptions) => ActionResult
  'rpc:scrollHorizontally': (options: ScrollHorizontallyOptions) => ActionResult
  'rpc:executeJavascript': (script: string) => ActionResult

  // Mask operations
  'rpc:showMask': () => void
  'rpc:hideMask': () => void

  // Lifecycle
  'rpc:dispose': () => void
}

2. Command Protocol (SidePanel → Background)

Used by SidePanel UI to control the agent.

interface AgentCommandProtocol {
  'agent:execute': (task: string) => void
  'agent:stop': () => void
  'agent:getState': () => AgentState
  'agent:configure': (config: LLMConfig) => void
}

3. Event Protocol (Background → SidePanel)

Used by Background to push updates to SidePanel.

interface AgentEventProtocol {
  'event:status': (status: AgentStatus) => void
  'event:history': (history: HistoricalEvent[]) => void
  'event:activity': (activity: AgentActivity) => void
  'event:stateSnapshot': (state: AgentState) => void
}

Communication Flow

Task Execution Flow

1. User enters task in SidePanel
   └─> SidePanel sends 'agent:execute' command

2. Background receives command
   ├─> Creates PageAgentCore with RemotePageController
   └─> Starts task execution

3. Agent executes step loop:
   ├─> LLM generates next action
   ├─> Agent calls RemotePageController method
   │   └─> RPC message sent to ContentScript
   ├─> ContentScript executes on real PageController
   │   └─> RPC response returned
   ├─> Agent updates history
   └─> Background broadcasts events to SidePanel

4. SidePanel receives events
   └─> Updates UI (status, history, activity)

5. Task completes or user stops
   └─> Agent disposes, status changes to idle/completed/error

Configuration Flow

1. User opens Settings in SidePanel
2. User enters API credentials
3. SidePanel sends 'agent:configure' command
4. Background saves config to chrome.storage.local
5. Next agent creation uses new config

File Structure

packages/extension/src/
├── agent/
│   └── RemotePageController.ts    # Proxy for PageController
├── entrypoints/
│   ├── background.ts              # Service worker
│   ├── content.ts                 # Content script
│   └── sidepanel/
│       ├── index.html
│       ├── main.tsx
│       └── App.tsx                # Main UI component
├── messaging/
│   ├── protocol.ts                # Message type definitions
│   ├── rpc.ts                     # RPC client for PageController
│   ├── events.ts                  # Event broadcasting utilities
│   └── index.ts                   # Module exports
├── components/ui/                 # shadcn components
├── lib/utils.ts                   # Utility functions
└── assets/index.css               # Tailwind styles

Extension Considerations

Current Limitations (v1)

  1. Single page control only - Agent controls the active tab where SidePanel was opened
  2. No cross-tab navigation - Cannot follow links that open in new tabs
  3. Session-based - Agent state is not persisted across extension restarts

Future Extension Points

Multi-tab Control

To support controlling multiple tabs:

  1. Add tabId parameter to RPC messages
  2. Track tab-to-controller mapping in Background
  3. Allow SidePanel to switch between controlled tabs

Persistent Sessions

To persist agent sessions:

  1. Store session state in chrome.storage.local
  2. Restore agent on extension startup
  3. Handle service worker restarts gracefully

Cross-tab Navigation

To follow links in new tabs:

  1. Listen to chrome.tabs.onCreated events
  2. Inject content script into new tabs
  3. Transfer control to new tab when navigation occurs

Screenshot/Vision Support

To add visual context for the agent:

  1. Use chrome.tabs.captureVisibleTab for screenshots
  2. Send images to vision-capable LLM models
  3. Add screenshot tool to agent toolkit

Security Considerations

  1. API Key Storage - Keys stored in chrome.storage.local (extension-only access)
  2. Content Script Isolation - Runs in isolated world, not accessible to page scripts
  3. Message Validation - Only trusted extension contexts can send/receive messages
  4. Permission Scope - Request minimal permissions needed for functionality

Development

# Install dependencies
npm install

# Start development server
npm run dev

# Build for production
npm run build

# Package extension
npm run zip