-
Notifications
You must be signed in to change notification settings - Fork 9
Add Computer Use overview page #425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+229
−2
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
d7b74a4
Add Computer Use overview page
AnnaXWang 96e32c0
Apply suggestion from @AnnaXWang
AnnaXWang 3780ea9
Apply suggestion from @AnnaXWang
AnnaXWang 4f00c55
Replace em-dashes with colons, parentheses, and semicolons
AnnaXWang 3393a5e
Add cua-agent build-your-own-agent guide
AnnaXWang fe1b639
Apply suggestions from code review
AnnaXWang 3cad613
Add per-provider cua-agent snippets
AnnaXWang 35bf54b
Remove em-dashes from OpenAI and Gemini intros
AnnaXWang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| --- | ||
| title: "Overview" | ||
| description: "Run computer use agents on Kernel cloud browsers" | ||
| --- | ||
|
|
||
| Computer use models are vision-language models (VLMs) that operate a browser the way a person does: they look at a screenshot, decide what to do next, and emit a concrete action: move the mouse, click, type, scroll, or drag. Kernel runs these agents on cloud browsers, so you don't install or maintain anything locally, and gives the model the low-level [Computer Controls API](/browsers/computer-controls) it needs to see the screen and act on it. | ||
|
|
||
| ## How computer use works on Kernel | ||
|
|
||
| Every computer use integration runs the same action-observation loop: | ||
|
|
||
| 1. **Capture** a screenshot of the current browser state with the [Computer Controls API](/browsers/computer-controls#take-screenshots). | ||
| 2. **Predict** the next action by sending that screenshot to your model. | ||
| 3. **Execute** the returned action (click, type, scroll, drag, or key press) through Computer Controls. | ||
| 4. **Repeat** until the task is complete. | ||
|
|
||
| Computer Controls emulates native keyboard and mouse input at the OS level (with human-like [Bézier curves](/browsers/computer-controls#move-the-mouse) by default) instead of driving the page over the Chrome DevTools Protocol (CDP). This keeps the loop close to real user input and reduces the automation signals that [bot detection](/browsers/bot-detection/overview) systems look for. | ||
|
|
||
| The loop works with any VLM that predicts actions from pixels. The models below are the ones we maintain ready-to-deploy templates and guides for. | ||
|
|
||
| ## Supported models | ||
|
|
||
| <CardGroup cols={2}> | ||
| <Card title="Anthropic" icon="robot" href="/integrations/computer-use/anthropic"> | ||
| Claude's computer use tool | ||
| </Card> | ||
| <Card title="Gemini" icon="google" href="/integrations/computer-use/gemini"> | ||
| Google's Gemini 2.5 Computer Use model | ||
| </Card> | ||
| <Card title="OpenAGI" icon="wand-magic-sparkles" href="/integrations/computer-use/openagi"> | ||
| OpenAGI's Lux model | ||
| </Card> | ||
| <Card title="OpenAI" icon="circle-nodes" href="/integrations/computer-use/openai"> | ||
| OpenAI's computer-using agent (CUA) | ||
| </Card> | ||
| <Card title="Tzafon" icon="bolt" href="/integrations/computer-use/tzafon"> | ||
| Tzafon's Northstar CUA Fast model | ||
| </Card> | ||
| <Card title="Yutori" icon="location-arrow" href="/integrations/computer-use/yutori"> | ||
| Yutori's Navigator n1.5 pixels-to-actions model | ||
| </Card> | ||
| </CardGroup> | ||
|
|
||
| Using a model that isn't listed here? Any VLM works; wire its predicted actions straight to the [Computer Controls API](/browsers/computer-controls) and run the same loop. | ||
|
|
||
| ## Get started | ||
|
|
||
| Each model page includes a one-command template so you can deploy a working agent in minutes. For example, to scaffold the Anthropic integration: | ||
|
|
||
| ```bash | ||
| kernel create --name my-computer-use-app --template computer-use | ||
| ``` | ||
|
|
||
| Pick a model above to get its template, then follow the [deploy](/apps/deploy) and [invoke](/apps/invoke) guides to run your agent on Kernel. | ||
|
|
||
| ## Build your own agent | ||
|
|
||
| For full control over the loop, [`@onkernel/cua-agent`](https://github.com/kernel/cua/tree/main/packages/agent) is a TypeScript library that runs it against a Kernel browser for you. You point it at a model, give it a task, and it handles the screenshots, actions, and follow-up turns. | ||
|
|
||
| ```bash | ||
| npm install @onkernel/cua-agent @onkernel/cua-ai @onkernel/sdk | ||
| ``` | ||
|
|
||
| ```ts | ||
| import Kernel from "@onkernel/sdk"; | ||
| import { CuaAgent } from "@onkernel/cua-agent"; | ||
|
|
||
| const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); | ||
| const browser = await client.browsers.create({ stealth: true }); | ||
|
|
||
| const agent = new CuaAgent({ | ||
| browser, | ||
| client, | ||
| initialState: { | ||
| model: "anthropic:claude-opus-4-7", // swap to target another provider | ||
| systemPrompt: "You are a careful browser automation agent.", | ||
| }, | ||
| }); | ||
|
|
||
| await agent.prompt("Open news.ycombinator.com and summarize the top story."); | ||
| ``` | ||
|
|
||
| Switch providers by changing the `model` ref: | ||
|
|
||
| | Provider | Model ref | | ||
| | --- | --- | | ||
| | Anthropic | `anthropic:claude-opus-4-7` | | ||
| | OpenAI | `openai:gpt-5.5` | | ||
| | Gemini | `google:gemini-3-flash-preview` | | ||
| | Tzafon | `tzafon:tzafon.northstar-cua-fast` | | ||
| | Yutori | `yutori:n1.5-latest` | | ||
|
|
||
| Set the matching provider key (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`, `TZAFON_API_KEY`, or `YUTORI_API_KEY`) alongside `KERNEL_API_KEY`. | ||
|
|
||
| ## Benefits of using Kernel for computer use | ||
|
|
||
| - **No local browser management**: Run computer use automations without installing or maintaining browsers locally | ||
| - **Scalability**: Launch multiple browser sessions in parallel for concurrent AI agents | ||
| - **Stealth mode**: Built-in anti-detection features for reliable web interactions | ||
| - **Session state**: Maintain browser state across runs via [Profiles](/auth/profiles) | ||
| - **Live view**: Debug your agents with real-time browser viewing | ||
| - **Cloud infrastructure**: Run computationally intensive AI agents without local resource constraints | ||
|
|
||
| ## Next steps | ||
|
|
||
| - Read the [Computer Controls API](/browsers/computer-controls) reference for the full set of mouse, keyboard, and screenshot actions | ||
| - Check out [live view](/browsers/live-view) for debugging your automations | ||
| - Learn about [stealth mode](/browsers/bot-detection/stealth) for avoiding detection | ||
| - Learn how to properly [terminate browser sessions](/browsers/termination) | ||
| - Learn how to [deploy](/apps/deploy) your computer use app to Kernel | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Profiles linked instead of Managed Auth
Low Severity
The new Session state benefit points readers to
Profilesfor maintaining browser state across runs. Integration benefits that cover session persistence or authenticated browsing are expected to highlight Managed Auth (e.g./auth/overview) as the primary path, not Profiles alone.Triggered by learned rule: Integration pages should highlight Managed Auth for authenticated browsing
Reviewed by Cursor Bugbot for commit fe1b639. Configure here.