# PageAgentExt Architecture This document describes the architecture of the Chrome extension version of PageAgent, including environment definitions, communication protocols, and extension considerations. ## Environment Definitions The extension operates across three isolated JavaScript contexts: ### 1. Background (Service Worker) **File:** `src/entrypoints/background.ts` **Responsibilities:** - Hosts the headless `PageAgentCore` instance - Manages agent lifecycle (create, execute, stop, dispose) - Stores LLM configuration in `chrome.storage.local` - Receives commands from SidePanel via messaging - Broadcasts events to SidePanel for UI updates - Uses `RemotePageController` to proxy DOM operations to ContentScript **Key Components:** - `PageAgentCore` - The AI agent (from `@page-agent/core`) - `RemotePageController` - Proxy that forwards calls to ContentScript - Command handlers for `agent:execute`, `agent:stop`, `agent:configure` ### 2. Content Script **File:** `src/entrypoints/content.ts` **Responsibilities:** - Runs in the context of web pages - Hosts the real `PageController` instance (lazy-initialized) - Performs actual DOM operations (click, input, scroll, etc.) - Responds to RPC messages from Background - Manages visual mask overlay during automation **Key Components:** - `PageController` - DOM controller (from `@page-agent/page-controller`) - RPC handlers for all PageController methods **Lifecycle:** PageController is created lazily on first RPC call and disposed between tasks. This ensures clean state for each task and enables future multi-page support. ### 3. Side Panel (React UI) **Files:** `src/entrypoints/sidepanel/` **Responsibilities:** - Provides user interface for controlling the agent - Displays task input and execution history - Shows real-time agent activity (thinking, executing, etc.) - Manages LLM configuration settings - Sends commands to Background and receives event updates **Key Components:** - `App.tsx` - Main React component with chat-style UI - `ConfigPanel` - Settings form for LLM configuration - Event subscription for real-time updates ## Communication Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Side Panel │ │ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │ │ │ Task Input │ │ Event Stream │ │ History Display │ │ │ └──────┬───────┘ └──────▲───────┘ └───────────────────────┘ │ └─────────┼─────────────────┼─────────────────────────────────────┘ │ Commands │ Events ▼ │ ┌─────────────────────────────────────────────────────────────────┐ │ Background │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ PageAgentCore │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ │ │ LLM │ │ Tools │ │ RemotePageCtrl │ │ │ │ │ └─────────────┘ └─────────────┘ └────────┬─────────┘ │ │ │ └─────────────────────────────────────────────┼────────────┘ │ └────────────────────────────────────────────────┼────────────────┘ │ RPC ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Content Script │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ PageController │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ │ │ DOM Tree │ │ Actions │ │ Mask │ │ │ │ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │ │ └──────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌───────────────┐ │ Web Page │ │ DOM │ └───────────────┘ ``` ## Message Protocol All cross-context communication uses `@webext-core/messaging` for type safety. ### Protocol Definition **File:** `src/messaging/protocol.ts` ### 1. RPC Protocol (Background → ContentScript) Used by `RemotePageController` to call `PageController` methods. ```typescript interface PageControllerRPCProtocol { // State queries 'rpc:getCurrentUrl': () => string 'rpc:getLastUpdateTime': () => number 'rpc:getBrowserState': () => BrowserState // DOM operations 'rpc:updateTree': () => string 'rpc:cleanUpHighlights': () => void // Element actions 'rpc:clickElement': (index: number) => ActionResult 'rpc:inputText': (data: { index: number; text: string }) => ActionResult 'rpc:selectOption': (data: { index: number; optionText: string }) => ActionResult 'rpc:scroll': (options: ScrollOptions) => ActionResult 'rpc:scrollHorizontally': (options: ScrollHorizontallyOptions) => ActionResult 'rpc:executeJavascript': (script: string) => ActionResult // Mask operations 'rpc:showMask': () => void 'rpc:hideMask': () => void // Lifecycle 'rpc:dispose': () => void } ``` ### 2. Command Protocol (SidePanel → Background) Used by SidePanel UI to control the agent. ```typescript interface AgentCommandProtocol { 'agent:execute': (task: string) => void 'agent:stop': () => void 'agent:getState': () => AgentState 'agent:configure': (config: LLMConfig) => void } ``` ### 3. Event Protocol (Background → SidePanel) Used by Background to push updates to SidePanel. ```typescript interface AgentEventProtocol { 'event:status': (status: AgentStatus) => void 'event:history': (history: HistoricalEvent[]) => void 'event:activity': (activity: AgentActivity) => void 'event:stateSnapshot': (state: AgentState) => void } ``` ## Communication Flow ### Task Execution Flow ``` 1. User enters task in SidePanel └─> SidePanel sends 'agent:execute' command 2. Background receives command ├─> Creates PageAgentCore with RemotePageController └─> Starts task execution 3. Agent executes step loop: ├─> LLM generates next action ├─> Agent calls RemotePageController method │ └─> RPC message sent to ContentScript ├─> ContentScript executes on real PageController │ └─> RPC response returned ├─> Agent updates history └─> Background broadcasts events to SidePanel 4. SidePanel receives events └─> Updates UI (status, history, activity) 5. Task completes or user stops └─> Agent disposes, status changes to idle/completed/error ``` ### Configuration Flow ``` 1. User opens Settings in SidePanel 2. User enters API credentials 3. SidePanel sends 'agent:configure' command 4. Background saves config to chrome.storage.local 5. Next agent creation uses new config ``` ## File Structure ``` packages/extension/src/ ├── agent/ │ └── RemotePageController.ts # Proxy for PageController ├── entrypoints/ │ ├── background.ts # Service worker │ ├── content.ts # Content script │ └── sidepanel/ │ ├── index.html │ ├── main.tsx │ └── App.tsx # Main UI component ├── messaging/ │ ├── protocol.ts # Message type definitions │ ├── rpc.ts # RPC client for PageController │ ├── events.ts # Event broadcasting utilities │ └── index.ts # Module exports ├── components/ui/ # shadcn components ├── lib/utils.ts # Utility functions └── assets/index.css # Tailwind styles ``` ## Design Decisions ### Tab ID Binding **Problem:** When a task completes while the page is not focused (user switched tabs), RPC messages like `hideMask` or `dispose` would be sent to the wrong tab because `chrome.tabs.query({ active: true })` returns the currently active tab, not the original target tab. **Solution:** `RemotePageController` captures the target tab ID at construction time and binds it to its RPC client. All subsequent RPC calls use this fixed tab ID regardless of which tab is currently active. ``` Task starts → RemotePageController created → tabId captured (e.g., 123) User switches to another tab (456 is now active) Task completes → hideMask RPC sent to tab 123 (correct!) ``` ### Lazy PageController Lifecycle **Problem:** PageController was created once when content script loaded and persisted until page unload. If the mask was disposed mid-task, subsequent tasks couldn't show it again. **Solution:** PageController is now lazy-initialized on first RPC call and fully disposed between tasks. Each task gets a fresh PageController instance with its own mask. ``` Task 1: showMask → creates PageController + Mask → execute → hideMask → dispose → null Task 2: showMask → creates new PageController + Mask → ... ``` This also prepares for future multi-page workflows where PageController may need to be recreated when navigating between pages. ## Extension Considerations ### Current Limitations (v1) 1. **Single page control only** - Agent controls the active tab where SidePanel was opened 2. **No cross-tab navigation** - Cannot follow links that open in new tabs 3. **Session-based** - Agent state is not persisted across extension restarts ### Future Extension Points #### Multi-tab Control To support controlling multiple tabs: 1. Add `tabId` parameter to RPC messages 2. Track tab-to-controller mapping in Background 3. Allow SidePanel to switch between controlled tabs #### Persistent Sessions To persist agent sessions: 1. Store session state in `chrome.storage.local` 2. Restore agent on extension startup 3. Handle service worker restarts gracefully #### Cross-tab Navigation To follow links in new tabs: 1. Listen to `chrome.tabs.onCreated` events 2. Inject content script into new tabs 3. Transfer control to new tab when navigation occurs #### Screenshot/Vision Support To add visual context for the agent: 1. Use `chrome.tabs.captureVisibleTab` for screenshots 2. Send images to vision-capable LLM models 3. Add screenshot tool to agent toolkit ## Security Considerations 1. **API Key Storage** - Keys stored in `chrome.storage.local` (extension-only access) 2. **Content Script Isolation** - Runs in isolated world, not accessible to page scripts 3. **Message Validation** - Only trusted extension contexts can send/receive messages 4. **Permission Scope** - Request minimal permissions needed for functionality ## Development ```bash # Install dependencies npm install # Start development server npm run dev # Build for production npm run build # Package extension npm run zip ```