Teaching AI to See: Building a Computer Use MCP Server

In my last post, I set up a dedicated Linux Mint workstation for Claude Code. It had all the dev tools, browser automation via Playwright, and cluster access. But it was still blind — it could control a browser tab, but couldn’t see the actual desktop.

Today I fixed that. CC can now screenshot the desktop, analyze what it sees, click on things, type text, and interact with any GUI application. Then it proved it by sending a Slack message.

The Problem

Claude Code in the terminal is powerful but limited. It can read files, run commands, and automate browser tabs through Playwright. But ask it to “bring up Slack” and it’s stuck — Slack is a desktop application, not a web page in a controlled browser session.

What I wanted was simple: give CC eyes (screenshots) and hands (mouse + keyboard). The same Computer Use capability that Anthropic has built into their API, but wired up as an MCP server so CC can use it naturally from the terminal.

The Architecture

The solution is a custom Python MCP server that bridges three things:

CC (Claude Code in terminal)
    ↓ MCP protocol
Computer Use MCP Server (Python)
    ↓ subprocess calls
Desktop Tools (scrot + xdotool + wmctrl)
    ↓ X11
Real Desktop (Slack, Telegram, Chrome, etc.)

The server exposes three tools:

Tool	What It Does
`screenshot`	Captures the desktop via scrot, scales from 4K to 1280x720, returns the image
`computer_action`	Executes a single mouse/keyboard action via xdotool
`run_task`	Full autonomous agent loop — give it a goal, it drives the desktop until done

The Code

The full server is about 360 lines of Python. Here are the interesting parts.

Screenshot with Scaling

The desktop runs at 3840x2160 (4K), but sending full-resolution screenshots to the API would burn tokens fast. Each screenshot gets scaled to 1280x720:

SCREEN_WIDTH = 3840
SCREEN_HEIGHT = 2160
SCALE_WIDTH = 1280
SCALE_HEIGHT = 720

def take_screenshot() -> str:
    """Capture screen, scale down, return base64 PNG."""
    tmp = "/tmp/cu_screenshot.png"
    subprocess.run(
        ["scrot", "-o", tmp],
        env={**os.environ, "DISPLAY": DISPLAY},
        check=True, timeout=5,
    )
    img = Image.open(tmp)
    img = img.resize((SCALE_WIDTH, SCALE_HEIGHT), Image.LANCZOS)
    buf = io.BytesIO()
    img.save(buf, format="PNG", optimize=True)
    return base64.standard_b64encode(buf.getvalue()).decode()

At 1280x720, each screenshot is about 350KB — roughly 1,500 tokens. A 20-step task costs about 30K tokens in images alone. Manageable.

Coordinate Scaling

When CC says “click at (335, 493)” in 1280x720 space, we need to scale that back to real 4K coordinates:

def scale_coordinates(x: int, y: int) -> tuple[int, int]:
    real_x = int(x * SCREEN_WIDTH / SCALE_WIDTH)
    real_y = int(y * SCREEN_HEIGHT / SCALE_HEIGHT)
    return real_x, real_y

Action Execution

Every action goes through xdotool. Clicks, typing, scrolling, dragging — all mapped to xdotool commands:

def execute_action(action: dict) -> str:
    env = {**os.environ, "DISPLAY": DISPLAY}
    action_type = action.get("action")

    if action_type == "left_click":
        x, y = scale_coordinates(*action["coordinate"])
        subprocess.run(["xdotool", "mousemove", str(x), str(y)], env=env)
        subprocess.run(["xdotool", "click", "1"], env=env)
        return f"clicked at ({x}, {y})"

    elif action_type == "type":
        text = action.get("text", "")
        subprocess.run(["xdotool", "type", "--clearmodifiers", "--", text], env=env)
        return f"typed: {text[:50]}..."

    elif action_type == "key":
        key = action.get("key", "")
        subprocess.run(["xdotool", "key", "--clearmodifiers", key], env=env)
        return f"pressed key: {key}"
    # ... scroll, drag, wait, etc.

The Agent Loop

The run_task tool is where it gets interesting. It’s a full autonomous loop powered by Bedrock:

Take a screenshot
Send it to Claude’s vision model with the task description
Claude responds with actions (click here, type this, scroll down)
Execute each action, take a new screenshot
Send the result back
Repeat until Claude says it’s done

def run_agent_loop(task: str, max_steps: int) -> str:
    client = AnthropicBedrock(aws_region="us-east-1")

    img_b64 = take_screenshot()
    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": f"Task: {task}"},
            {"type": "image", "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": img_b64,
            }},
        ],
    }]

    tools = [{
        "type": "computer_20250124",
        "name": "computer",
        "display_width_px": SCALE_WIDTH,
        "display_height_px": SCALE_HEIGHT,
        "display_number": 0,
    }]

    for step in range(max_steps):
        response = client.beta.messages.create(
            model="us.anthropic.claude-sonnet-4-6-v1:0",
            max_tokens=2048,
            tools=tools,
            messages=messages,
            betas=["computer-use-2025-01-24"],
        )

        if response.stop_reason == "end_turn":
            break

        for block in response.content:
            if block.type == "tool_use":
                execute_action(block.input)
                time.sleep(0.5)
                img_b64 = take_screenshot()
                # ... append result and continue loop

The key detail: this runs through Bedrock, not the direct Anthropic API. Same Computer Use capability, same vision model, just routed through AWS. No Anthropic Pro subscription needed.

MCP Registration

Registering it with CC is one command:

claude mcp add computer-use -s user -- \
  /home/dryden/computer-use/venv/bin/python3 \
  /home/dryden/computer-use/server.py

After a restart, CC has three new tools. No configuration, no API keys to manage — it picks up AWS credentials from the environment.

The First Test

I asked CC to take a screenshot. It worked immediately:

“I can see the desktop! Here’s what’s on screen: Google Chrome with multiple tabs, Telegram Desktop, and the terminal running Claude Code.”

It could identify every window, read text on screen, and describe the layout. That alone is useful — CC can now visually verify deployments, check dashboards, and see what applications are doing.

The Slack Message

Then I asked CC to bring up Slack. First it tried clicking taskbar icons — miss, miss, miss. Tiny icons at 4K resolution scaled to 720p are hard to hit. Then it got smart and used wmctrl:

wmctrl -l    # list all windows
wmctrl -ia 0x05600004   # activate Slack by window ID

Slack appeared. CC could see the workspace, channels, DMs, everything. I asked it to send a message to my coworker Robert:

“Hey Robert! This is Claude Code (CC) — Damon’s AI assistant. I’m messaging you from my own dedicated Linux Mint workstation that we just set up today. I can now see and interact with the desktop, including Slack. Pretty cool, right? 🤖”

It clicked the message input field, typed the message, and hit Enter. Message sent. First autonomous Slack message from an AI running on a homelab workstation.

Lessons Learned

Coordinate scaling matters. At 4K resolution, a 1-pixel miss at 720p means a 3-pixel miss on the real screen. Taskbar icons are especially finicky. Using wmctrl to raise windows by ID is far more reliable than clicking.

Screenshots are cheap enough. At ~1,500 tokens per screenshot, the cost is reasonable for interactive tasks. A 10-step task with screenshots at every step is about 15K image tokens plus the text — maybe $0.03 on Bedrock.

MCP is the right abstraction. By wrapping this as an MCP server, CC doesn’t need any special integration. It just calls screenshot or computer_action like any other tool. The vision intelligence comes from the model itself.

The real desktop beats virtual displays. I considered running this in Docker with Xvfb (virtual display), but using the real desktop means CC can interact with apps that are already running — Slack signed in, Telegram connected, Chrome with saved sessions. No setup duplication.

What’s Next

The run_task tool — the fully autonomous agent loop — is built but untested in production. That’s the endgame: tell CC “check Grafana for anomalies” or “file this document in Paperless” and it handles the entire multi-step workflow visually.

For now, the screenshot + action combo is already transforming the workflow. CC went from a powerful but blind terminal tool to something that can see and interact with anything on screen. It’s not quite “a human sitting at the computer,” but it’s getting there.

The full server code is in the homelab if you want to build your own. All you need is a Linux desktop, Python, scrot, xdotool, and a Bedrock-enabled AWS account.