Performance on Yarang's Tech Lair

Memory Safety and Efficient Resource Management of the ZeroClaw Agent Runtime

Sat, 09 May 2026 09:01:27 +0900

Memory Safety and Efficient Resource Management of the ZeroClaw Agent Runtime

As we’ve been building a high-performance multi-agent runtime through the ZeroClaw project, we’ve been contemplating how to leverage Rust’s distinctive features—‘memory safety’ and ‘zero-cost abstractions’—in practice. Beyond simply being safe, the core challenge was how to efficiently manage system resources and maintain stable performance without Garbage Collection (GC) in a scenario where numerous agents simultaneously exchange messages.

This post aims to share the efficient resource management strategies based on Rust and practical code examples that were applied during the ZeroClaw architecture design process.

Problem Definition: Resource Bottlenecks in Multi-Agent Environments

In multi-agent systems, each agent possesses its own independent state and communicates through asynchronous messages. This process gives rise to the following resource issues:

Frequent Allocation/Deallocation (Allocation Thrashing): When hundreds of agents process thousands of messages per second, frequent allocation and deallocation of heap memory become a primary cause of performance degradation.
Data Race: We must prevent race conditions that can occur when multiple agents access shared resources, while also avoiding bottlenecks caused by excessive lock usage.
Lifecycle Management: A mechanism is needed to safely reclaim resources, ensuring that memory leaks do not occur throughout the system even if an agent terminates abnormally.

Solution Strategy: Rust’s Ownership and Tokio’s Scheduling

To address these issues, ZeroClaw has combined Rust’s Ownership system with the asynchronous abstractions of the tokio runtime.

1. State Sharing using `Arc` and `RwLock`

For immutable data sharing in inter-agent communication, we’ve minimized costs using Arc (Atomic Reference Counting). For state updates, we’ve employed RwLock to allow concurrent read operations while ensuring data integrity only during write operations.

2. Message Passing via Channels

Instead of directly managing shared memory state, we adopted a message-passing approach (Actor model) using tokio::sync::mpsc channels. This fundamentally prevents data races by allowing each agent to exclusively manage its own state.

Practical Code Examples

Below is an example implementation of a simple agent message handler used in ZeroClaw’s communication layer.

Agent Message Definition and Handler Structure

use tokio::sync::{mpsc, RwLock};
use std::sync::Arc;
use std::time::Duration;

// Define the command types agents will process
#[derive(Debug)]
enum AgentCommand {
 ProcessTask(String),
 UpdateStatus(String),
 Shutdown,
}

// Agent's state structure
struct AgentState {
 id: String,
 status: String,
 processed_tasks: u64,
}

// Agent executor structure
struct AgentExecutor {
 state: Arc<RwLock<AgentState>>,
 receiver: mpsc::Receiver<AgentCommand>,
}

impl AgentExecutor {
 // Constructor for creating a new agent
 fn new(id: String, receiver: mpsc::Receiver<AgentCommand>) -> Self {
 Self {
 state: Arc::new(RwLock::new(AgentState {
 id,
 status: "Initialized".to_string(),
 processed_tasks: 0,
 })),
 receiver,
 }
 }

 // Start the message reception and processing loop
 async fn run(mut self) {
 println!("Agent {} started.", self.state.read().await.id);
 
 while let Some(cmd) = self.receiver.recv().await {
 match cmd {
 AgentCommand::ProcessTask(task_id) => {
 // Simulate asynchronous work (e.g., LLM inference request)
 let task_id_clone = task_id.clone();
 let state_clone = Arc::clone(&self.state);
 
 // Process as a background task to avoid blocking the message loop
 tokio::spawn(async move {
 tokio::time::sleep(Duration::from_millis(100)).await;
 let mut state = state_clone.write().await;
 state.processed_tasks += 1;
 state.status = format!("Processing {}", task_id_clone);
 println!("Task {} processed by Agent {}. Total: {}", 
 task_id_clone, state.id, state.processed_tasks);
 });
 }
 AgentCommand::UpdateStatus(new_status) => {
 let mut state = self.state.write().await;
 state.status = new_status;
 }
 AgentCommand::Shutdown => {
 println!("Agent {} shutting down...", self.state.read().await.id);
 break;
 }
 }
 }
 }
}

Main Runtime Configuration and Resource Management

Now, let’s write the main runtime code that creates and manages the agents above. Here, we implement graceful shutdown using the tokio::select! macro to prevent resource leaks.

#[tokio::main]
async fn main() {
 // Store a list of senders for managing multiple agents
 // Managed as a Vec to handle agent termination
 let mut agent_senders = Vec::new();

 // Spawn 3 agents
 for i in 0..3 {
 let (tx, rx) = mpsc::channel(100); // Buffer size 100
 agent_senders.push(tx);
 
 let executor = AgentExecutor::new(format!("Agent-{}", i), rx);
 tokio::spawn(executor.run());
 }

 // System-wide shutdown signal (handling Ctrl+C, etc.)
 let (shutdown_tx, mut shutdown_rx) = mpsc::channel::<()>(1);
 
 // Task distribution logic (simulation)
 let task_distributor = tokio::spawn(async move {
 let mut task_counter = 0;
 loop {
 // Check for shutdown signal
 if shutdown_rx.try_recv().is_ok() {
 println!("Task distributor stopping...");
 break;
 }

 // Send tasks to agents in a round-robin fashion
 if !agent_senders.is_empty() {
 let target_index = task_counter % agent_senders.len();
 let task_id = format!("Task-{}", task_counter);
 
 if let Err(_) = agent_senders[target_index].send(AgentCommand::ProcessTask(task_id)).await {
 println!("Failed to send task. Agent might be dead.");
 }
 
 task_counter += 1;
 tokio::time::sleep(Duration::from_millis(50)).await;
 }
 }
 });

 // Simulate system shutdown after 5 seconds
 tokio::time::sleep(Duration::from_secs(5)).await;
 
 // 1. Terminate task distribution
 let _ = shutdown_tx.send(()).await;
 task_distributor.await.unwrap();

 // 2. Send shutdown command to all agents
 for tx in agent_senders {
 let _ = tx.send(AgentCommand::Shutdown).await;
 }

 // Wait for resource cleanup
 tokio::time::sleep(Duration::from_millis(500)).await;
 println!("System shutdown complete.");
}

Key Point Analysis

Arc<RwLock<State>> Pattern: The AgentExecutor stores its state wrapped in Arc<RwLock>. Asynchronous tasks created with tokio::spawn receive a clone of this Arc. This is very lightweight as it only increments the reference count, not by copying the data itself.
Ownership Transfer in MPSC Channels: The tx (Sender) end is owned by the main loop, and the rx (Receiver) end is owned by the AgentExecutor. This clear separation of ownership ensures at compile time who sends and who receives messages.
Harmony of Asynchronous I/O and Locks: When using state.write().await, the current task is suspended (yielded) until it acquires the lock for writing. This differs from blocking an OS thread and allows other tasks to utilize the CPU, thereby increasing multi-core utilization.

Conclusion

Rust’s memory management mechanisms are not just about safety; they become a powerful tool for designing high-performance server architectures. In the ZeroClaw project, this allowed us to minimize inter-agent communication overhead and achieve predictable latency. In particular, the channel-based architecture combined with the tokio runtime provides a foundation for maintaining stability even in complex systems where thousands of agents interact.

In the next post, we will expand on inter-agent communication to discuss an architecture for implementing file-based persistence.

Reference Links

The Evolution of Redis Arrays: An Architectural Analysis for Large-Scale Data Processing

Tue, 05 May 2026 09:00:52 +0900

The Evolution of Redis Arrays: An Architectural Analysis for Large-Scale Data Processing

Hello everyone! I recently came across an interesting article on Hacker News, written by Oran Agra, one of Redis’s core developers, titled “Redis array: short story of a long development process.” This wasn’t just a story about adding a new feature; it was a testament to the dedication of developers who tackled 25-year-old legacy code, ensuring performance, maintaining stability, and formatting a massive codebase overnight.

Today, based on this article, we’ll dive deep into how the Array data structure has evolved within Redis and what lessons we can learn for designing large-scale systems.

1. The Problem: The Shackle of 25-Year-Old Legacy Code

Redis’s LIST data structure internally uses QuickList. QuickList combines the advantages of ziplist and linkedlist, which are doubly linked lists. However, when dealing with massive lists containing tens of millions of elements, memory fragmentation and cache misses were causing significant performance degradation.

Specifically, when processing array-type data, the existing structure had the following bottlenecks:

Memory Overhead: Additional memory usage due to pointer connections.
Sequential Access Cost: Latency caused by inefficient use of cache lines.

To address this, the development team decided to overhaul the internal structure at the C language level. The biggest challenge here was the “legacy code that had to be changed.”

2. The Solution: Formatting a 25M-line Codebase

The most impressive part of the article was “Formatting a 25M-line codebase overnight.” The process of formatting and refactoring 25 million lines of code required more than just technical challenges; it demanded strategy akin to chess.

2.1. Preparations for Refactoring

The biggest fear in large-scale refactoring is “regression.” Modifying the array structure could affect hundreds of Redis commands (like LPUSH, RPUSH, LINDEX, etc.).

To mitigate this, the team adopted the following approach:

Expand Test Coverage: Ensure existing commands pass unit tests.
Strengthen CI/CD Pipeline: Implement benchmarking scripts to immediately detect performance degradation upon code changes.

2.2. The New Structure of Redis Arrays

The improved Array structure moved beyond simply allocating memory and was modified to maximize data locality. The core principle was “maximizing the use of contiguous memory blocks while allowing for segmentation and management when necessary.”

This yielded the following benefits:

Improved CPU Cache Hit Rate: Significantly increased L1/L2 cache hit rates due to contiguous memory access.
Memory Savings: Reduced actual data storage space by minimizing unnecessary pointer connections.

3. Practical Guide: Efficient Array Usage in Redis

Now that we’ve covered the theoretical background, let’s look at how to apply it in practice with code.

3.1. Problems with Existing List Usage

First, let’s consider the traditional way of adding tens of millions of items to a list. This operates based on QuickList, and as the number of items increases, the number of jumps also increases.

# Traditional Method (QuickList based)
# Add 10,000,000 items (potential for memory and speed degradation)
for i in {1..10000000}; do
 redis-cli LPUSH my_huge_list "item:$i"
done

3.2. Optimization using Streams and Hashes

While the internal improvements to Redis Arrays are transparent to users, when designing, we need to consider “data size” and “access patterns.” If simple sequential storage is all that’s needed, using the latest version of Redis alone will provide benefits.

However, if you need to search or modify data within the array, it’s advisable to use HASH instead of LIST.

import redis
import time

r = redis.Redis(host='localhost', port=6379, db=0)

# Scenario: Storing Log Data (Large Scale)
# 1. Using List (for sequential storage)
def push_to_list(count):
 start = time.time()
 for i in range(count):
 r.lpush("logs:timeline", f"log_entry_{i}")
 print(f"List pushed {count} items in {time.time() - start:.4f}s")

# 2. Using Hash (for search and modification)
def push_to_hash(count):
 start = time.time()
 pipe = r.pipeline()
 for i in range(count):
 pipe.hset("logs:details", f"entry_{i}", f"log_content_{i}")
 pipe.execute()
 print(f"Hash pushed {count} items in {time.time() - start:.4f}s")

if __name__ == "__main__":
 # Test inserting 100,000 data points
 push_to_list(100000)
 push_to_hash(100000)

Execution Result Analysis: In recent Redis versions (7.x and above), the internal Array structure is optimized, making LPUSH very fast. However, if you frequently need to retrieve data at a specific index, LINDEX has a complexity of O(N), making the O(1) approach using HGET much more advantageous.

4. Conclusion: The Harmony of Development Culture and Technology

The development process of Redis Arrays offers us important lessons:

Performance Isn’t Free: Improving 25-year-old code requires commensurate refactoring and testing costs.
Investment in Tools: This work was possible due to automated tools and a CI/CD environment capable of formatting 25 million lines of code.

When we design systems, we need to go beyond simply asking “Is it fast?” and consider “How can we achieve maintainable performance?” As the Redis team demonstrated, sometimes we must not shy away from large-scale improvements that shake the foundations of the architecture.

5. References

Thank you!

Beyond Hardware Limits: Unraveling Disk Physical Structure with Microbenchmarking

Mon, 04 May 2026 20:49:16 +0900

Beyond Hardware Limits: Unraveling Disk Physical Structure with Microbenchmarking

Recently, an interesting 2019 article was brought back into the spotlight via Hacker News: “Discovering hard disk physical geometry through microbenchmarking.” In an era where high-performance SSDs are commonplace, why is it important to understand the physical structure of rotational media (HDDs)?

In fact, the core of this article goes beyond the simple structure of a hard disk, focusing on “a methodology for inferring hardware’s internal operations through Observable Performance.” This principle is applicable not only to analyzing the performance characteristics of modern NVMe SSDs with ZNS (Zoned Namespace) storage but also to low-power network devices like the recently discussed BYOMesh based on LoRa.

In this post, we will practice the microbenchmarking technique of uncovering the hardware’s “Physical Geometry” by writing a simple code ourselves.

Why Microbenchmarking?

Software developers can work without knowing complex hardware details thanks to the abstraction layers between the OS and hardware. However, this changes when developing systems that require high performance, such as e-commerce platforms handling high transaction volumes or analytical systems processing large amounts of data.

It is difficult to accurately know the actual sector layout, cache memory size, or rotational latency using only OS commands like fstat or lsblk. At this point, microbenchmarking, which involves performing read/write operations and measuring the time taken, becomes the most powerful tool.

Fundamental Principles of Benchmarking

The data access speed of a hard disk drive (HDD) is determined by the following three factors:

Seek Time: The time it takes for the head to move to the relevant track (physical movement).
Rotational Latency: The time until the sector containing the data rotates under the head.
Transfer Time: The time to actually read the data.

We will focus on ‘Seek Time’. The further the head has to move, the longer it takes. By measuring the time difference between reading adjacent sectors and sectors far apart, we can infer the disk’s physical layout (track and cylinder structure).

Hands-on: Exploring Disk Structure with Python

Now, let’s use Python to measure the performance difference between random and sequential access. This code is a simple example to measure the cost of moving between the ‘Outer Zone’ and ‘Inner Zone’ of a disk.

Caution: This script accesses actual disk devices (e.g., /dev/sdX). Be sure to use a test disk with no data on it or run it in a VM environment. Accessing the wrong device can lead to data corruption.

import os
import time
import sys

# Disk path to test (needs to be changed to a VM or separate test disk)
# Example: '/dev/sdb' for Linux, '/dev/rdisk2' for macOS
DISK_PATH = '/dev/sdb'
# Read block size (4KB)
BLOCK_SIZE = 4096
# Number of measurements
ITERATIONS = 1000

def benchmark_random_access(fd, size):
 """Measures performance when accessing random locations"""
 total_bytes = os.path.getsize(DISK_PATH) if os.path.exists(DISK_PATH) else size

 start_time = time.time()
 for _ in range(ITERATIONS):
 # Calculate random offset (maintain block alignment)
 offset = os.urandom(8)
 offset_int = int.from_bytes(offset, 'big') % (total_bytes - BLOCK_SIZE)
 aligned_offset = (offset_int // BLOCK_SIZE) * BLOCK_SIZE

 os.lseek(fd, aligned_offset, os.SEEK_SET)
 os.read(fd, BLOCK_SIZE)

 end_time = time.time()
 return (end_time - start_time) * 1000 # Convert to ms

def benchmark_sequential_access(fd):
 """Measures performance when accessing sequential locations"""
 start_time = time.time()
 for _ in range(ITERATIONS):
 os.read(fd, BLOCK_SIZE)

 end_time = time.time()
 return (end_time - start_time) * 1000 # Convert to ms

if __name__ == "__main__":
 if not os.path.exists(DISK_PATH):
 print(f"Error: {DISK_PATH} not found. Please update DISK_PATH.")
 sys.exit(1)

 print(f"Benchmarking {DISK_PATH}...")

 try:
 # It is recommended to use the O_DIRECT flag to minimize buffering when opening the file (Linux).
 # Here, we proceed with the default mode for compatibility, but O_DIRECT is necessary for actual hardware access.
 fd = os.open(DISK_PATH, os.O_RDONLY | os.O_SYNC)

 print("1. Measuring Random Access (Simulating Head Seek)..." )
 # Random access is slow due to continuous head movement
 random_time = benchmark_random_access(fd, 1024*1024*1024) # Assume 1GB
 print(f" Random Access Time: {random_time:.2f} ms")

 print("2. Measuring Sequential Access (Minimal Head Movement)...")
 # Reset file pointer to the beginning
 os.lseek(fd, 0, os.SEEK_SET)
 sequential_time = benchmark_sequential_access(fd)
 print(f" Sequential Access Time: {sequential_time:.2f} ms")

 print("\n--- Analysis ---")
 print(f"Performance Gap (Seek Cost): {random_time - sequential_time:.2f} ms")
 print("The gap represents the time spent moving the disk head physically.")

 os.close(fd)
 except PermissionError:
 print("Error: Permission denied. Try running with 'sudo'.")
 except Exception as e:
 print(f"Error: {e}")

Interpreting and Utilizing Results

Running the code above, you will observe that random access is significantly slower than sequential access. This ‘Gap’ is precisely the time spent on physical seeking and rotation.

If you were to perform this measurement separately at the beginning of the disk (outer tracks) and at the end (inner tracks), you might discover that the outer tracks have a faster transfer rate than the inner tracks due to the disk’s Zone Bit Recording (ZBR) structure. In the past, this was utilized to tune data placement to the front of the disk.

Modern Relevance: Lessons from the SSD and Cloud Era

Although spinning disk technology is becoming a thing of the past, the principle of “understanding a system’s internals through performance measurement” remains unchanged.

SSD Internal Parallelism: SSDs internally operate multiple channels and planes in parallel. If performance dramatically increases when we induce sequential reads using multithreading, this can be a signal to infer the internal controller’s parallel processing capabilities.
Cloud Storage I/O: By capturing phenomena like the ‘Burst’ followed by a ‘Baseline’ drop in disk I/O performance on AWS or Azure through microbenchmarking, you can design cost-effective architectures.

Conclusion

The ‘Discovering hard disk physical geometry’ article, which regained attention on Hacker News, goes beyond mere curiosity to remind us of the most fundamental stance in diagnosing system performance bottlenecks.

Instead of vaguely concluding “the disk is slow,” proving with data “where and why it is slow” by running simple scripts yourself. This is the first step towards true performance tuning.

We encourage you to run the benchmarking code written in today’s post in your development environment. Discovering unexpected hardware characteristics and directly observing their impact on system performance will be a very interesting experience.

Performance on Yarang's Tech Lair

Memory Safety and Efficient Resource Management of the ZeroClaw Agent Runtime

Memory Safety and Efficient Resource Management of the ZeroClaw Agent Runtime

Problem Definition: Resource Bottlenecks in Multi-Agent Environments

Solution Strategy: Rust’s Ownership and Tokio’s Scheduling

1. State Sharing using Arc and RwLock

2. Message Passing via Channels

Practical Code Examples

Agent Message Definition and Handler Structure

Main Runtime Configuration and Resource Management

Key Point Analysis

Conclusion

Reference Links

The Evolution of Redis Arrays: An Architectural Analysis for Large-Scale Data Processing

The Evolution of Redis Arrays: An Architectural Analysis for Large-Scale Data Processing

1. The Problem: The Shackle of 25-Year-Old Legacy Code

2. The Solution: Formatting a 25M-line Codebase

2.1. Preparations for Refactoring

2.2. The New Structure of Redis Arrays

3. Practical Guide: Efficient Array Usage in Redis

3.1. Problems with Existing List Usage

3.2. Optimization using Streams and Hashes

4. Conclusion: The Harmony of Development Culture and Technology

5. References

Beyond Hardware Limits: Unraveling Disk Physical Structure with Microbenchmarking

Beyond Hardware Limits: Unraveling Disk Physical Structure with Microbenchmarking

Why Microbenchmarking?

Fundamental Principles of Benchmarking

Hands-on: Exploring Disk Structure with Python

Interpreting and Utilizing Results

Modern Relevance: Lessons from the SSD and Cloud Era

Conclusion

References

1. State Sharing using `Arc` and `RwLock`