<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>CostOptimization on Yarang's Tech Lair</title><link>https://blog.fcoinfup.com/tags/costoptimization/</link><description>Recent content in CostOptimization on Yarang's Tech Lair</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Wed, 06 May 2026 09:00:48 +0900</lastBuildDate><atom:link href="https://blog.fcoinfup.com/tags/costoptimization/index.xml" rel="self" type="application/rss+xml"/><item><title>Computer Use API vs Structured Output: Cost-Effective LLM Implementation Strategies</title><link>https://blog.fcoinfup.com/post/computer-use-api-vs-structured-output-cost-effective-llm-implementation-strategies/</link><pubDate>Wed, 06 May 2026 09:00:48 +0900</pubDate><guid>https://blog.fcoinfup.com/post/computer-use-api-vs-structured-output-cost-effective-llm-implementation-strategies/</guid><description>&lt;h1 id="computer-use-api-vs-structured-output-cost-effective-llm-implementation-strategies"&gt;Computer Use API vs Structured Output: Cost-Effective LLM Implementation Strategies
&lt;/h1&gt;&lt;p&gt;Recently, I came across an interesting article on Hacker News. It was titled &lt;strong&gt;[Computer Use is 45x more expensive than structured APIs]&lt;/strong&gt;. Anthropic&amp;rsquo;s latest feature, &amp;lsquo;Computer Use&amp;rsquo;, allows AI to see the computer screen and manipulate the mouse and keyboard to perform tasks on behalf of the user. It&amp;rsquo;s quite fascinating, much like an AI inputting combos in the game Tekken for a player.&lt;/p&gt;
&lt;p&gt;However, an analysis revealed that the implementation cost of this feature is a staggering &lt;strong&gt;45 times higher&lt;/strong&gt; than using traditional &lt;strong&gt;Structured Output (like JSON mode)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this post, we&amp;rsquo;ll analyze why such a gap exists and how we can wisely address this cost issue in our development of &lt;strong&gt;Multi-Agent Systems (e.g., ZeroClaw)&lt;/strong&gt;, complete with practical code examples.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="1-analyzing-the-cause-of-the-cost-gap"&gt;1. Analyzing the Cause of the Cost Gap
&lt;/h2&gt;&lt;h3 id="computer-use-gui-based-approach"&gt;Computer Use (GUI-Based Approach)
&lt;/h3&gt;&lt;p&gt;&amp;lsquo;Computer Use&amp;rsquo; is essentially similar to &lt;strong&gt;VNC (RDP) remote control&lt;/strong&gt;. In each turn, the AI must perform the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Screen Capture:&lt;/strong&gt; Download a high-resolution image. (Leads to a surge in token costs)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visual Processing:&lt;/strong&gt; Run a large-scale Vision model to understand the image.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Coordinate Calculation:&lt;/strong&gt; Calculate the button&amp;rsquo;s position in pixels.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action Execution:&lt;/strong&gt; Send mouse clicks/keyboard inputs.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process consumes millions of &amp;lsquo;visual tokens&amp;rsquo; instead of simple text responses.&lt;/p&gt;
&lt;h3 id="structured-output-api-based-approach"&gt;Structured Output (API-Based Approach)
&lt;/h3&gt;&lt;p&gt;On the other hand, the traditional approach we configure through blog API servers or MCP (Model Context Protocol) is far more efficient.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Text Input:&lt;/strong&gt; System status or user intent is conveyed as text.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logical Reasoning:&lt;/strong&gt; The LLM parses the text and makes decisions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Direct Invocation:&lt;/strong&gt; Functions are directly executed via &lt;code&gt;tool_use&lt;/code&gt; blocks. (No image processing required)&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="2-practical-solution-hybrid-architecture"&gt;2. Practical Solution: Hybrid Architecture
&lt;/h2&gt;&lt;p&gt;It&amp;rsquo;s wasteful to handle every task using Computer Use. We need to apply the &lt;strong&gt;&amp;lsquo;Principle of Tool Separation&amp;rsquo;&lt;/strong&gt; learned from projects like &lt;strong&gt;ZeroClaw&lt;/strong&gt; or &lt;strong&gt;MCP Blog Automation&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id="strategy-tool-usage-priority"&gt;Strategy: Tool Usage Priority
&lt;/h3&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Priority 1: Native API (Structured Output)&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;Tasks with clear logic, such as database lookups, API calls, and file creation, should always be handled by function calls.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Priority 2: Browser Automation (Playwright/Selenium)&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;For complex DOM manipulation where no backend API exists. (Parsing an HTML tree is cheaper than image processing)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Last Resort: Computer Use (Vision)&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;Targeted only for situations with screen captures or legacy software where DOM access is impossible, such as video editing programs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="3-code-example-implementing-a-cost-optimized-agent"&gt;3. Code Example: Implementing a Cost-Optimized Agent
&lt;/h2&gt;&lt;p&gt;Let&amp;rsquo;s create a Python example that allows an LLM to selectively use API calls (Structured) and browser control (Browser). Since Computer Use is still tied to specific cloud environments, we&amp;rsquo;ll introduce code that compares the most realistic alternatives: &lt;strong&gt;Playwright (HTML-based)&lt;/strong&gt; and &lt;strong&gt;API calls&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id="scenario-automating-blog-post-publication"&gt;Scenario: Automating Blog Post Publication
&lt;/h3&gt;&lt;p&gt;Let&amp;rsquo;s assume we ask an LLM agent to &amp;ldquo;Summarize the latest tech news and publish it to my blog.&amp;rdquo;&lt;/p&gt;
&lt;h4 id="structured-approach-structured-output--api"&gt;Structured Approach (Structured Output + API)
&lt;/h4&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; json
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;from&lt;/span&gt; typing &lt;span style="color:#f92672"&gt;import&lt;/span&gt; Literal
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 1. Tool Definitions (API Approach)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;tools &lt;span style="color:#f92672"&gt;=&lt;/span&gt; [
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;function&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;function&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;name&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;create_blog_post&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;description&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;Publishes a new post to the blog. (Cheapest and fastest)&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;parameters&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;object&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;properties&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;title&amp;#34;&lt;/span&gt;: {&lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;string&amp;#34;&lt;/span&gt;},
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;: {&lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;string&amp;#34;&lt;/span&gt;},
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;tags&amp;#34;&lt;/span&gt;: {&lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;array&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;items&amp;#34;&lt;/span&gt;: {&lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;string&amp;#34;&lt;/span&gt;}}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; },
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;required&amp;#34;&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#34;title&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; },
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;function&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;function&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;name&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;search_web_browser&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;description&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;Controls the web browser to search for information. (Use when no API is available)&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;parameters&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;object&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;properties&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;query&amp;#34;&lt;/span&gt;: {&lt;span style="color:#e6db74"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;string&amp;#34;&lt;/span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; },
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;required&amp;#34;&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#34;query&amp;#34;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# 2. Agent Execution Logic (Simulation)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;run_agent&lt;/span&gt;(user_query: str):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Step 1: LLM requests tool usage (in reality, this is an OpenAI/Anthropic API call)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Simulating LLM response: Selecting the create_blog_post tool&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; llm_response &lt;span style="color:#f92672"&gt;=&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;tool&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;create_blog_post&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;arguments&amp;#34;&lt;/span&gt;: {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;title&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;Gemma 4 Acceleration Techniques&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;Google&amp;#39;s latest model, Gemma, through multi-token prediction...&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;tags&amp;#34;&lt;/span&gt;: [&lt;span style="color:#e6db74"&gt;&amp;#34;AI&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;Google&amp;#34;&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Step 2: Local function execution (no vision needed)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; llm_response[&lt;span style="color:#e6db74"&gt;&amp;#39;tool&amp;#39;&lt;/span&gt;] &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#39;create_blog_post&amp;#39;&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; print(&lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;[API Execution] Publishing blog post: &lt;/span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;llm_response[&lt;span style="color:#e6db74"&gt;&amp;#39;arguments&amp;#39;&lt;/span&gt;][&lt;span style="color:#e6db74"&gt;&amp;#39;title&amp;#39;&lt;/span&gt;]&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# In reality, this would be a requests.post(&amp;#39;https://blog-api.com/posts&amp;#39;, ...) call&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; {&lt;span style="color:#e6db74"&gt;&amp;#34;status&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;success&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;cost&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;0.0001 USD&amp;#34;&lt;/span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;print(run_agent(&lt;span style="color:#e6db74"&gt;&amp;#34;Write a blog post for me.&amp;#34;&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This method is very inexpensive as it only exchanges text.&lt;/p&gt;
&lt;h4 id="unstructured-approach-computer-use-simulation---increased-cost"&gt;Unstructured Approach (Computer Use Simulation - Increased Cost)
&lt;/h4&gt;&lt;p&gt;Imagine if we bypassed the blog API and used Computer Use to open a web browser and write the post.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Pseudocode for Computer Use approach (cost explosion area)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;run_computer_use_agent&lt;/span&gt;():
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# 1. Screen Capture (1024x768 image -&amp;gt; approx. 1,100 tokens consumed)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; screenshot &lt;span style="color:#f92672"&gt;=&lt;/span&gt; capture_screen()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; print(&lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;[Vision] Analyzing screen... (1,100 tokens consumed)&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# 2. LLM Inference: &amp;#34;Find the login button&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; action &lt;span style="color:#f92672"&gt;=&lt;/span&gt; llm_vision_inference(screenshot, prompt&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;Find the login button&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Result: {&amp;#34;x&amp;#34;: 500, &amp;#34;y&amp;#34;: 300, &amp;#34;action&amp;#34;: &amp;#34;click&amp;#34;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; print(&lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;[Action] Moving mouse and clicking: &lt;/span&gt;&lt;span style="color:#e6db74"&gt;{&lt;/span&gt;action&lt;span style="color:#e6db74"&gt;}&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# 3. Capture screen again and analyze input fields&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; screenshot &lt;span style="color:#f92672"&gt;=&lt;/span&gt; capture_screen()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; print(&lt;span style="color:#e6db74"&gt;f&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;[Vision] Analyzing input fields... (1,100 tokens consumed)&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; 
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# ... (Repetitive capture and inference)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; {&lt;span style="color:#e6db74"&gt;&amp;#34;status&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;success&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;cost&amp;#34;&lt;/span&gt;: &lt;span style="color:#e6db74"&gt;&amp;#34;0.05 USD&amp;#34;&lt;/span&gt;} 
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;# Potential cost increase of ~500x compared to API approach (0.0001 USD)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;hr&gt;
&lt;h2 id="4-zeroclaw-and-mcp-architecture-application-guide"&gt;4. ZeroClaw and MCP Architecture Application Guide
&lt;/h2&gt;&lt;p&gt;Applying this principle to our ongoing projects like &lt;strong&gt;ZeroClaw (high-performance Rust agent)&lt;/strong&gt; or &lt;strong&gt;Discord MCP&lt;/strong&gt; leads to the following design.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adherence to MCP (Model Context Protocol) Standard:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Expose all possible resources (file system, databases, cloud resources) to the &lt;strong&gt;MCP Server&lt;/strong&gt;, allowing LLMs to control them via &lt;strong&gt;Structured JSON&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Example: When sending a Discord message, guide the LLM to call &lt;code&gt;discord_mcp.send_message()&lt;/code&gt; instead of opening a browser.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Prompt Engineering:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Clearly declare in the system prompt.&lt;/li&gt;
&lt;li&gt;

 &lt;blockquote&gt;
 &lt;p&gt;&amp;ldquo;You should call tools instead of looking at the screen. To fulfill user requests, first check the &lt;code&gt;available_tools&lt;/code&gt; list and prioritize checking for function calls.&amp;rdquo;&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fallback Mechanism:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create a two-stage structure that wakes up the &amp;lsquo;Computer Use&amp;rsquo; or &amp;lsquo;Browser Automation&amp;rsquo; agent only when the &lt;code&gt;MCP Server&lt;/code&gt; or API is down, or when visual confirmation is absolutely necessary.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="5-conclusion"&gt;5. Conclusion
&lt;/h2&gt;&lt;p&gt;When developing AI agents, &amp;lsquo;Computer Use&amp;rsquo; is like a &amp;lsquo;Swiss Army knife&amp;rsquo;. It can do everything, but if you pull out the large knife (capture the screen) to tighten a single screw, the cost becomes immense.&lt;/p&gt;
&lt;p&gt;We must use the &lt;strong&gt;right tool for the right job&lt;/strong&gt;. For most tasks, we should opt for &lt;strong&gt;Structured Output (API)&lt;/strong&gt;, and only resort to &lt;strong&gt;Vision/GUI&lt;/strong&gt; functions when absolutely unavoidable. This strategy allows us to turn the &lt;strong&gt;45x cost difference&lt;/strong&gt; into our advantage.&lt;/p&gt;
&lt;p&gt;We will prioritize this cost-effectiveness as a guiding principle in the communication protocol design for the upcoming &lt;strong&gt;ZeroClaw&lt;/strong&gt; project.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description></item></channel></rss>