293 lines
11 KiB
Markdown
293 lines
11 KiB
Markdown
# PageAgentExt Architecture
|
|
|
|
This document describes the architecture of the Chrome extension version of PageAgent, including environment definitions, communication protocols, and extension considerations.
|
|
|
|
## Environment Definitions
|
|
|
|
The extension operates across three isolated JavaScript contexts:
|
|
|
|
### 1. Background (Service Worker)
|
|
|
|
**File:** `src/entrypoints/background.ts`
|
|
|
|
**Responsibilities:**
|
|
|
|
- Hosts the headless `PageAgentCore` instance
|
|
- Manages agent lifecycle (create, execute, stop, dispose)
|
|
- Stores LLM configuration in `chrome.storage.local`
|
|
- Receives commands from SidePanel via messaging
|
|
- Broadcasts events to SidePanel for UI updates
|
|
- Uses `RemotePageController` to proxy DOM operations to ContentScript
|
|
|
|
**Key Components:**
|
|
|
|
- `PageAgentCore` - The AI agent (from `@page-agent/core`)
|
|
- `RemotePageController` - Proxy that forwards calls to ContentScript
|
|
- Command handlers for `agent:execute`, `agent:stop`, `agent:configure`
|
|
|
|
### 2. Content Script
|
|
|
|
**File:** `src/entrypoints/content.ts`
|
|
|
|
**Responsibilities:**
|
|
|
|
- Runs in the context of web pages
|
|
- Hosts the real `PageController` instance
|
|
- Performs actual DOM operations (click, input, scroll, etc.)
|
|
- Responds to RPC messages from Background
|
|
- Manages visual mask overlay during automation
|
|
|
|
**Key Components:**
|
|
|
|
- `PageController` - DOM controller (from `@page-agent/page-controller`)
|
|
- RPC handlers for all PageController methods
|
|
|
|
### 3. Side Panel (React UI)
|
|
|
|
**Files:** `src/entrypoints/sidepanel/`
|
|
|
|
**Responsibilities:**
|
|
|
|
- Provides user interface for controlling the agent
|
|
- Displays task input and execution history
|
|
- Shows real-time agent activity (thinking, executing, etc.)
|
|
- Manages LLM configuration settings
|
|
- Sends commands to Background and receives event updates
|
|
|
|
**Key Components:**
|
|
|
|
- `App.tsx` - Main React component with chat-style UI
|
|
- `ConfigPanel` - Settings form for LLM configuration
|
|
- Event subscription for real-time updates
|
|
|
|
## Communication Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Side Panel │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
|
|
│ │ Task Input │ │ Event Stream │ │ History Display │ │
|
|
│ └──────┬───────┘ └──────▲───────┘ └───────────────────────┘ │
|
|
└─────────┼─────────────────┼─────────────────────────────────────┘
|
|
│ Commands │ Events
|
|
▼ │
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Background │
|
|
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
│ │ PageAgentCore │ │
|
|
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │
|
|
│ │ │ LLM │ │ Tools │ │ RemotePageCtrl │ │ │
|
|
│ │ └─────────────┘ └─────────────┘ └────────┬─────────┘ │ │
|
|
│ └───────────────────────────────────────────────┼───────────┘ │
|
|
└───────────────────────────────────────────────────┼──────────────┘
|
|
│ RPC
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Content Script │
|
|
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
│ │ PageController │ │
|
|
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │
|
|
│ │ │ DOM Tree │ │ Actions │ │ Mask │ │ │
|
|
│ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │
|
|
│ └───────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────┐
|
|
│ Web Page │
|
|
│ DOM │
|
|
└───────────────┘
|
|
```
|
|
|
|
## Message Protocol
|
|
|
|
All cross-context communication uses `@webext-core/messaging` for type safety.
|
|
|
|
### Protocol Definition
|
|
|
|
**File:** `src/messaging/protocol.ts`
|
|
|
|
### 1. RPC Protocol (Background → ContentScript)
|
|
|
|
Used by `RemotePageController` to call `PageController` methods.
|
|
|
|
```typescript
|
|
interface PageControllerRPCProtocol {
|
|
// State queries
|
|
'rpc:getCurrentUrl': () => string
|
|
'rpc:getLastUpdateTime': () => number
|
|
'rpc:getBrowserState': () => BrowserState
|
|
|
|
// DOM operations
|
|
'rpc:updateTree': () => string
|
|
'rpc:cleanUpHighlights': () => void
|
|
|
|
// Element actions
|
|
'rpc:clickElement': (index: number) => ActionResult
|
|
'rpc:inputText': (data: { index: number; text: string }) => ActionResult
|
|
'rpc:selectOption': (data: { index: number; optionText: string }) => ActionResult
|
|
'rpc:scroll': (options: ScrollOptions) => ActionResult
|
|
'rpc:scrollHorizontally': (options: ScrollHorizontallyOptions) => ActionResult
|
|
'rpc:executeJavascript': (script: string) => ActionResult
|
|
|
|
// Mask operations
|
|
'rpc:showMask': () => void
|
|
'rpc:hideMask': () => void
|
|
|
|
// Lifecycle
|
|
'rpc:dispose': () => void
|
|
}
|
|
```
|
|
|
|
### 2. Command Protocol (SidePanel → Background)
|
|
|
|
Used by SidePanel UI to control the agent.
|
|
|
|
```typescript
|
|
interface AgentCommandProtocol {
|
|
'agent:execute': (task: string) => void
|
|
'agent:stop': () => void
|
|
'agent:getState': () => AgentState
|
|
'agent:configure': (config: LLMConfig) => void
|
|
}
|
|
```
|
|
|
|
### 3. Event Protocol (Background → SidePanel)
|
|
|
|
Used by Background to push updates to SidePanel.
|
|
|
|
```typescript
|
|
interface AgentEventProtocol {
|
|
'event:status': (status: AgentStatus) => void
|
|
'event:history': (history: HistoricalEvent[]) => void
|
|
'event:activity': (activity: AgentActivity) => void
|
|
'event:stateSnapshot': (state: AgentState) => void
|
|
}
|
|
```
|
|
|
|
## Communication Flow
|
|
|
|
### Task Execution Flow
|
|
|
|
```
|
|
1. User enters task in SidePanel
|
|
└─> SidePanel sends 'agent:execute' command
|
|
|
|
2. Background receives command
|
|
├─> Creates PageAgentCore with RemotePageController
|
|
└─> Starts task execution
|
|
|
|
3. Agent executes step loop:
|
|
├─> LLM generates next action
|
|
├─> Agent calls RemotePageController method
|
|
│ └─> RPC message sent to ContentScript
|
|
├─> ContentScript executes on real PageController
|
|
│ └─> RPC response returned
|
|
├─> Agent updates history
|
|
└─> Background broadcasts events to SidePanel
|
|
|
|
4. SidePanel receives events
|
|
└─> Updates UI (status, history, activity)
|
|
|
|
5. Task completes or user stops
|
|
└─> Agent disposes, status changes to idle/completed/error
|
|
```
|
|
|
|
### Configuration Flow
|
|
|
|
```
|
|
1. User opens Settings in SidePanel
|
|
2. User enters API credentials
|
|
3. SidePanel sends 'agent:configure' command
|
|
4. Background saves config to chrome.storage.local
|
|
5. Next agent creation uses new config
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
packages/extension/src/
|
|
├── agent/
|
|
│ └── RemotePageController.ts # Proxy for PageController
|
|
├── entrypoints/
|
|
│ ├── background.ts # Service worker
|
|
│ ├── content.ts # Content script
|
|
│ └── sidepanel/
|
|
│ ├── index.html
|
|
│ ├── main.tsx
|
|
│ └── App.tsx # Main UI component
|
|
├── messaging/
|
|
│ ├── protocol.ts # Message type definitions
|
|
│ ├── rpc.ts # RPC client for PageController
|
|
│ ├── events.ts # Event broadcasting utilities
|
|
│ └── index.ts # Module exports
|
|
├── components/ui/ # shadcn components
|
|
├── lib/utils.ts # Utility functions
|
|
└── assets/index.css # Tailwind styles
|
|
```
|
|
|
|
## Extension Considerations
|
|
|
|
### Current Limitations (v1)
|
|
|
|
1. **Single page control only** - Agent controls the active tab where SidePanel was opened
|
|
2. **No cross-tab navigation** - Cannot follow links that open in new tabs
|
|
3. **Session-based** - Agent state is not persisted across extension restarts
|
|
|
|
### Future Extension Points
|
|
|
|
#### Multi-tab Control
|
|
|
|
To support controlling multiple tabs:
|
|
|
|
1. Add `tabId` parameter to RPC messages
|
|
2. Track tab-to-controller mapping in Background
|
|
3. Allow SidePanel to switch between controlled tabs
|
|
|
|
#### Persistent Sessions
|
|
|
|
To persist agent sessions:
|
|
|
|
1. Store session state in `chrome.storage.local`
|
|
2. Restore agent on extension startup
|
|
3. Handle service worker restarts gracefully
|
|
|
|
#### Cross-tab Navigation
|
|
|
|
To follow links in new tabs:
|
|
|
|
1. Listen to `chrome.tabs.onCreated` events
|
|
2. Inject content script into new tabs
|
|
3. Transfer control to new tab when navigation occurs
|
|
|
|
#### Screenshot/Vision Support
|
|
|
|
To add visual context for the agent:
|
|
|
|
1. Use `chrome.tabs.captureVisibleTab` for screenshots
|
|
2. Send images to vision-capable LLM models
|
|
3. Add screenshot tool to agent toolkit
|
|
|
|
## Security Considerations
|
|
|
|
1. **API Key Storage** - Keys stored in `chrome.storage.local` (extension-only access)
|
|
2. **Content Script Isolation** - Runs in isolated world, not accessible to page scripts
|
|
3. **Message Validation** - Only trusted extension contexts can send/receive messages
|
|
4. **Permission Scope** - Request minimal permissions needed for functionality
|
|
|
|
## Development
|
|
|
|
```bash
|
|
# Install dependencies
|
|
npm install
|
|
|
|
# Start development server
|
|
npm run dev
|
|
|
|
# Build for production
|
|
npm run build
|
|
|
|
# Package extension
|
|
npm run zip
|
|
```
|