Files
page-agent/packages/extension/structure.md

293 lines
11 KiB
Markdown

# PageAgentExt Architecture
This document describes the architecture of the Chrome extension version of PageAgent, including environment definitions, communication protocols, and extension considerations.
## Environment Definitions
The extension operates across three isolated JavaScript contexts:
### 1. Background (Service Worker)
**File:** `src/entrypoints/background.ts`
**Responsibilities:**
- Hosts the headless `PageAgentCore` instance
- Manages agent lifecycle (create, execute, stop, dispose)
- Stores LLM configuration in `chrome.storage.local`
- Receives commands from SidePanel via messaging
- Broadcasts events to SidePanel for UI updates
- Uses `RemotePageController` to proxy DOM operations to ContentScript
**Key Components:**
- `PageAgentCore` - The AI agent (from `@page-agent/core`)
- `RemotePageController` - Proxy that forwards calls to ContentScript
- Command handlers for `agent:execute`, `agent:stop`, `agent:configure`
### 2. Content Script
**File:** `src/entrypoints/content.ts`
**Responsibilities:**
- Runs in the context of web pages
- Hosts the real `PageController` instance
- Performs actual DOM operations (click, input, scroll, etc.)
- Responds to RPC messages from Background
- Manages visual mask overlay during automation
**Key Components:**
- `PageController` - DOM controller (from `@page-agent/page-controller`)
- RPC handlers for all PageController methods
### 3. Side Panel (React UI)
**Files:** `src/entrypoints/sidepanel/`
**Responsibilities:**
- Provides user interface for controlling the agent
- Displays task input and execution history
- Shows real-time agent activity (thinking, executing, etc.)
- Manages LLM configuration settings
- Sends commands to Background and receives event updates
**Key Components:**
- `App.tsx` - Main React component with chat-style UI
- `ConfigPanel` - Settings form for LLM configuration
- Event subscription for real-time updates
## Communication Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Side Panel │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Task Input │ │ Event Stream │ │ History Display │ │
│ └──────┬───────┘ └──────▲───────┘ └───────────────────────┘ │
└─────────┼─────────────────┼─────────────────────────────────────┘
│ Commands │ Events
▼ │
┌─────────────────────────────────────────────────────────────────┐
│ Background │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PageAgentCore │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │
│ │ │ LLM │ │ Tools │ │ RemotePageCtrl │ │ │
│ │ └─────────────┘ └─────────────┘ └────────┬─────────┘ │ │
│ └───────────────────────────────────────────────┼───────────┘ │
└───────────────────────────────────────────────────┼──────────────┘
│ RPC
┌─────────────────────────────────────────────────────────────────┐
│ Content Script │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PageController │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │
│ │ │ DOM Tree │ │ Actions │ │ Mask │ │ │
│ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌───────────────┐
│ Web Page │
│ DOM │
└───────────────┘
```
## Message Protocol
All cross-context communication uses `@webext-core/messaging` for type safety.
### Protocol Definition
**File:** `src/messaging/protocol.ts`
### 1. RPC Protocol (Background → ContentScript)
Used by `RemotePageController` to call `PageController` methods.
```typescript
interface PageControllerRPCProtocol {
// State queries
'rpc:getCurrentUrl': () => string
'rpc:getLastUpdateTime': () => number
'rpc:getBrowserState': () => BrowserState
// DOM operations
'rpc:updateTree': () => string
'rpc:cleanUpHighlights': () => void
// Element actions
'rpc:clickElement': (index: number) => ActionResult
'rpc:inputText': (data: { index: number; text: string }) => ActionResult
'rpc:selectOption': (data: { index: number; optionText: string }) => ActionResult
'rpc:scroll': (options: ScrollOptions) => ActionResult
'rpc:scrollHorizontally': (options: ScrollHorizontallyOptions) => ActionResult
'rpc:executeJavascript': (script: string) => ActionResult
// Mask operations
'rpc:showMask': () => void
'rpc:hideMask': () => void
// Lifecycle
'rpc:dispose': () => void
}
```
### 2. Command Protocol (SidePanel → Background)
Used by SidePanel UI to control the agent.
```typescript
interface AgentCommandProtocol {
'agent:execute': (task: string) => void
'agent:stop': () => void
'agent:getState': () => AgentState
'agent:configure': (config: LLMConfig) => void
}
```
### 3. Event Protocol (Background → SidePanel)
Used by Background to push updates to SidePanel.
```typescript
interface AgentEventProtocol {
'event:status': (status: AgentStatus) => void
'event:history': (history: HistoricalEvent[]) => void
'event:activity': (activity: AgentActivity) => void
'event:stateSnapshot': (state: AgentState) => void
}
```
## Communication Flow
### Task Execution Flow
```
1. User enters task in SidePanel
└─> SidePanel sends 'agent:execute' command
2. Background receives command
├─> Creates PageAgentCore with RemotePageController
└─> Starts task execution
3. Agent executes step loop:
├─> LLM generates next action
├─> Agent calls RemotePageController method
│ └─> RPC message sent to ContentScript
├─> ContentScript executes on real PageController
│ └─> RPC response returned
├─> Agent updates history
└─> Background broadcasts events to SidePanel
4. SidePanel receives events
└─> Updates UI (status, history, activity)
5. Task completes or user stops
└─> Agent disposes, status changes to idle/completed/error
```
### Configuration Flow
```
1. User opens Settings in SidePanel
2. User enters API credentials
3. SidePanel sends 'agent:configure' command
4. Background saves config to chrome.storage.local
5. Next agent creation uses new config
```
## File Structure
```
packages/extension/src/
├── agent/
│ └── RemotePageController.ts # Proxy for PageController
├── entrypoints/
│ ├── background.ts # Service worker
│ ├── content.ts # Content script
│ └── sidepanel/
│ ├── index.html
│ ├── main.tsx
│ └── App.tsx # Main UI component
├── messaging/
│ ├── protocol.ts # Message type definitions
│ ├── rpc.ts # RPC client for PageController
│ ├── events.ts # Event broadcasting utilities
│ └── index.ts # Module exports
├── components/ui/ # shadcn components
├── lib/utils.ts # Utility functions
└── assets/index.css # Tailwind styles
```
## Extension Considerations
### Current Limitations (v1)
1. **Single page control only** - Agent controls the active tab where SidePanel was opened
2. **No cross-tab navigation** - Cannot follow links that open in new tabs
3. **Session-based** - Agent state is not persisted across extension restarts
### Future Extension Points
#### Multi-tab Control
To support controlling multiple tabs:
1. Add `tabId` parameter to RPC messages
2. Track tab-to-controller mapping in Background
3. Allow SidePanel to switch between controlled tabs
#### Persistent Sessions
To persist agent sessions:
1. Store session state in `chrome.storage.local`
2. Restore agent on extension startup
3. Handle service worker restarts gracefully
#### Cross-tab Navigation
To follow links in new tabs:
1. Listen to `chrome.tabs.onCreated` events
2. Inject content script into new tabs
3. Transfer control to new tab when navigation occurs
#### Screenshot/Vision Support
To add visual context for the agent:
1. Use `chrome.tabs.captureVisibleTab` for screenshots
2. Send images to vision-capable LLM models
3. Add screenshot tool to agent toolkit
## Security Considerations
1. **API Key Storage** - Keys stored in `chrome.storage.local` (extension-only access)
2. **Content Script Isolation** - Runs in isolated world, not accessible to page scripts
3. **Message Validation** - Only trusted extension contexts can send/receive messages
4. **Permission Scope** - Request minimal permissions needed for functionality
## Development
```bash
# Install dependencies
npm install
# Start development server
npm run dev
# Build for production
npm run build
# Package extension
npm run zip
```