Files
page-agent/packages/extension/structure.md
2026-01-24 19:37:46 +08:00

321 lines
14 KiB
Markdown

# PageAgentExt Architecture
This document describes the MV3-compliant architecture of the Chrome extension version of PageAgent.
## Design Principles
The architecture follows Chrome MV3 Service Worker constraints:
1. **Service Worker is stateless** - No long-running loops, no in-memory state
2. **Agent runs in frontend context** - SidePanel hosts all agent logic
3. **SW is a message relay** - Only forwards messages between contexts
4. **Event-driven** - All operations are triggered by user actions or message events
## Environment Definitions
The extension operates across three isolated JavaScript contexts:
### 1. Side Panel (Frontend - Agent Host)
**Files:** `src/entrypoints/sidepanel/`
**Responsibilities:**
- Hosts `PageAgentCore` instance and main execution loop
- Manages `TabsManager` for multi-tab control
- Uses `RemotePageController` to proxy DOM operations via SW
- Stores agent state (task, history, status)
- Provides React UI for user interaction
- Handles `shouldShowMask` queries from content scripts
**Key Components:**
- `AgentController` - Encapsulates agent lifecycle, isolated from UI
- `useAgent` hook - React integration for AgentController
- `App.tsx` - Main UI component
- `ConfigPanel` - LLM settings
**Lifecycle:** When sidepanel closes, agent disposes naturally. No state persists in SW.
### 2. Background (Service Worker - Stateless Relay)
**File:** `src/entrypoints/background.ts`
**Responsibilities:**
- Relays RPC messages from SidePanel to ContentScript
- Forwards tab events (onRemoved, onUpdated, onActivated, onFocusChanged) to SidePanel
- Opens sidepanel on action click
- **NO** agent logic, **NO** state
**Message Flows:**
```
SidePanel → SW → ContentScript (RPC calls)
ContentScript → SW → SidePanel (mask state queries)
SW → SidePanel (tab events)
```
### 3. Content Script
**File:** `src/entrypoints/content.ts`
**Responsibilities:**
- Runs in web page context
- Hosts real `PageController` instance (lazy-initialized)
- Handles RPC messages for DOM operations
- Queries SidePanel for mask state on page load
- Manages visual mask overlay
**Lifecycle:** PageController is created on first RPC call and disposed between tasks.
## Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ Side Panel (Frontend) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ AgentController │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
│ │ │ PageAgentCore│ │ TabsManager │ │RemotePageController│ │ │
│ │ └──────────────┘ └──────────────┘ └────────┬─────────┘ │ │
│ └───────────────────────────────────────────────┼────────────┘ │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ React UI │ │ Query Handler│◄─────────────┼───────────┐ │
│ │ (App.tsx) │ │(shouldShowMask) │ │ │
│ └──────────────┘ └──────────────┘ │ │ │
└──────────────────────────────────────────────────┼───────────┼───┘
│ │
RPC Call │ Query │
▼ │
┌─────────────────────────────────────────────────────────────────┐
│ Background (Service Worker) │
│ │
│ ┌────────────────┐ │
│ │ Message Relay │ │
│ │ (stateless) │ │
│ └───────┬────────┘ │
│ │ │
│ Tab Events ─────────────────┼─────────────────► SidePanel │
│ (removed, updated, │ │
│ activated, focusChanged) │ │
└──────────────────────────────┼───────────────────────────────────┘
│ RPC Forward
┌─────────────────────────────────────────────────────────────────┐
│ Content Script │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ PageController │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │
│ │ │ DOM Tree │ │ Actions │ │ Mask │ │ │
│ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌───────────────┐
│ Web Page │
│ DOM │
└───────────────┘
```
## Message Protocol
All messages use a simple type-based protocol defined in `src/messaging/protocol.ts`.
### Message Types
| Type | Direction | Purpose |
|------|-----------|---------|
| `rpc:call` | SidePanel → SW | Request to call PageController method |
| `rpc:response` | SW → SidePanel | Response from PageController |
| `cs:rpc` | SW → ContentScript | Forwarded RPC call |
| `cs:query` | ContentScript → SW | Query to SidePanel (e.g., shouldShowMask) |
| `query:response` | SW → ContentScript | Response to query |
| `tab:event` | SW → SidePanel | Tab events (removed/updated/activated/focusChanged) |
### RPC Methods
All PageController methods are available via RPC:
- State: `getCurrentUrl`, `getLastUpdateTime`, `getBrowserState`
- DOM: `updateTree`, `cleanUpHighlights`
- Actions: `clickElement`, `inputText`, `selectOption`, `scroll`, `scrollHorizontally`, `executeJavascript`
- Mask: `showMask`, `hideMask`
- Lifecycle: `dispose`
## Communication Flow
### Task Execution
```
1. User enters task in SidePanel
└─> AgentController.execute(task)
2. AgentController creates agent instances
├─> new PageAgentCore()
├─> new TabsManager()
└─> new RemotePageController()
3. Agent executes step loop:
├─> LLM generates next action
├─> RemotePageController.method() called
│ └─> RPC message → SW → ContentScript
├─> ContentScript executes on real PageController
│ └─> Response → SW → SidePanel
├─> Agent updates history
└─> React UI re-renders via events
4. Task completes or user stops
└─> Agent disposes, status changes
```
### Page Reload During Task
```
1. Page reloads/navigates
2. Content script initializes
3. Content script queries: shouldShowMask?
└─> cs:query → SW → SidePanel
4. SidePanel checks: agentRunning + windowFocus + (browserActiveTab === agentCurrentTab)
└─> query:response → SW → ContentScript
5. Content script shows/hides mask accordingly
```
## File Structure
```
packages/extension/src/
├── agent/
│ ├── RemotePageController.ts # Proxy for PageController RPC
│ ├── TabsManager.ts # Multi-tab management
│ └── tabTools.ts # Agent tools for tab control
├── entrypoints/
│ ├── background.ts # Stateless SW relay
│ ├── content.ts # Content script with PageController
│ └── sidepanel/
│ ├── AgentController.ts # Agent lifecycle management
│ ├── useAgent.ts # React hook for agent
│ ├── App.tsx # Main UI component
│ ├── components/
│ │ ├── ConfigPanel.tsx
│ │ ├── cards/
│ │ └── index.tsx
│ ├── index.html
│ └── main.tsx
├── messaging/
│ ├── protocol.ts # Message type definitions
│ ├── rpc.ts # RPC client for SidePanel
│ └── index.ts
├── components/ui/ # shadcn components
├── lib/utils.ts
└── utils/constants.ts
```
## Design Decisions
### Why Agent in SidePanel?
MV3 Service Workers have strict lifecycle constraints:
- Terminate after ~30s of inactivity
- Cannot maintain long-running loops
- State is lost on termination
By hosting the agent in SidePanel (a visible frontend page), we get:
- Persistent execution while panel is open
- Natural disposal when panel closes
- No SW wake-up complexity
### Agent Isolation from UI
`AgentController` is a separate class from the React UI for:
- **Testability** - Can test agent logic without React
- **Portability** - Future: move agent to popup, options page, or external page
- **Clean separation** - UI concerns don't pollute agent logic
### Simplified Messaging
Previous architecture had complex retry/wake-up logic for SW. New architecture:
- SW is stateless, always ready
- No ping/wake-up needed
- Simple request-response pattern
- Retry logic only for content script initialization
## Multi-Tab Control
### Tab Types
- **Initial Tab** - Where user started the task
- **Managed Tabs** - Tabs opened by agent via `open_new_tab`
### Tab Grouping
Agent-opened tabs are grouped in a Chrome tab group named `Task(<taskId>)`.
### Tab Switching
Only initial tab and managed tabs can be switched to. This prevents the agent from accessing unrelated tabs.
## Mask Management
The visual mask overlay blocks user interaction during automation. Mask visibility is centrally controlled by `AgentController` based on three conditions:
```
shouldMaskBeVisible = agentRunning && windowHasFocus && (browserActiveTab === agentCurrentTab)
```
### Key Concepts
- **browserActiveTab** - The tab currently visible to the user (tracked via `chrome.tabs.onActivated`)
- **agentCurrentTab** - The tab agent is operating on (`TabsManager.currentTabId`)
- **windowHasFocus** - Whether browser window has focus (tracked via `chrome.windows.onFocusChanged`)
### State Transitions
| Event | Action |
|-------|--------|
| Agent starts | Show mask if current tab is in foreground |
| Agent stops | Hide mask |
| User switches to agent's tab | Show mask |
| User switches away from agent's tab | Hide mask |
| Window loses focus | Hide mask |
| Window regains focus | Show mask if on agent's tab |
| Agent switches to different tab | Sync mask based on new state |
| Page reloads | Content script queries `shouldShowMask` |
### Implementation
- `AgentController.syncMaskState()` - Syncs mask visibility based on current state
- `AgentController.shouldShowMaskForTab(tabId)` - Used by content script queries
- Background forwards `activated` and `windowFocusChanged` events to SidePanel
- `RemotePageController` does NOT auto-show mask on tab switch (controlled by AgentController)
## Configuration
LLM config (apiKey, baseURL, model) is stored in `chrome.storage.local`. This persists across sessions and is managed via the ConfigPanel.
## Security
1. **API Key Storage** - Keys in `chrome.storage.local` (extension-only access)
2. **Content Script Isolation** - Runs in isolated world
3. **Tab Restriction** - Agent can only control tabs it opened or started from
4. **No Arbitrary Tab Access** - Cannot switch to unmanaged tabs
## Development
```bash
# Install dependencies
npm install
# Start development server
npm run dev
# Build for production
npm run build
# Package extension
npm run zip
```