Mobile Computer-Using Agent

droid-cua

Minimal AI agent that controls Android devices using OpenAI’s computer-use-preview model.


πŸš€ How It Works

  1. Connects to a running Android emulator.

  2. Captures full-screen device screenshots.

  3. Scales down the screenshots for OpenAI model compatibility.

  4. Sends screenshots and user instructions to OpenAI’s computer-use-preview model.

  5. Receives structured actions (click, scroll, type, keypress, wait, drag).

  6. Rescales model outputs back to real device coordinates.

  7. Executes the actions on the device.

  8. Repeats until you type exit.

Demo Video


πŸ›  Setup

First, clone this project - link

  1. Install dependencies:

  2. Create a .env file with your OpenAI API key:

  3. Make sure Android Debug Bridge (ADB) is available in your system PATH:

  4. Start your Android emulator manually (optional):

  5. Run the agent:

    If no --avd is provided, the agent will try to connect to the first running device.


🧠 Features

  • Captures screenshots directly from the device (adb exec-out screencap -p).

  • Dynamically scales screenshots for OpenAI compatibility.

  • Maps model-generated actions (click, scroll, drag, type, keypress, wait) back to real device coordinates.

  • Connects automatically to a running emulator or launches it if needed.

  • Pretends the device screen is embedded inside a browser page for environment compatibility.

  • Coming next: Asserions and test reports


πŸ“„ Command Line Flags

Flag
Description

--avd=AVD_NAME

Select the emulator device by AVD name.

--instructions=FILENAME

Load user instructions from a text file.

--record

Save every screenshot into a folder for later review or video creation.


πŸ“‹ Example Usage

Start your emulator:

Run the agent:

Run with an instructions file:

Example example.txt:


πŸ“¦ Requirements

  • Node.js 18 or higher

  • A running Android emulator (AVD)

  • Android Debug Bridge (ADB) installed and available in system PATH

  • OpenAI Tier 3 access for the computer-use-preview model

[!NOTE] Your OpenAI account must be Tier 3 to access the computer-use-preview model. Learn more: OpenAI Computer Use Preview


πŸ“ Project Structure

File
Responsibility

index.js

Manages user input, OpenAI conversation, and main loop.

device.js

ADB device connection, screenshot capture, screen size management.

actions.js

Executes model actions on the device (tap, swipe, drag, type, keypress).

openai.js

Sends requests to OpenAI and manages API responses.


🎞️ Convert Screenshots to Video (Optional)

If you run the agent with the --record flag, it saves all screenshots to a folder like:

You can convert the frames into a video using ffmpeg:

Last updated