cloud-native
stream processing
Trusted by teams from
Built by streaming experts from
Get started
Arroyo ships as a single, compact binary. Run locally on MacOS or Linux for development, deploy to production with Docker or Kubernetes.
Arroyo is a new kind of stream processing engine, built from the ground up to make real-time easier than batch.
that just works
modern cloud
to any workload
performance
Analytical SQL that just works
Arroyo was designed from the start so that anyone with SQL experience can build reliable, efficient, and correct streaming pipelines.
Data scientists and engineers can build end-to-end real-time applications, models, and dashboards—without a separate team of streaming experts.
1CREATE VIEW tags AS (
2 SELECT btrim(unnest(tags), '"') as tag FROM (
3 SELECT extract_json(value, '$.tags[*].name') AS tags
4 FROM mastodon)
5);
6
7SELECT * FROM (
8 SELECT *, ROW_NUMBER() OVER (
9 PARTITION BY window
10 ORDER BY count DESC) as row_num
11 FROM (SELECT count(*) as count,
12 tag,
13 hop(interval '5 seconds',
14 interval '15 minutes') as window
15 FROM tags
16 group by tag, window)) WHERE row_num <= 5;
Designed for the modern cloud
Your streaming pipelines shouldn't page someone just because Kubernetes decided to reschedule your pods. Arroyo is built to run in modern, elastic cloud environments, from simple container runtimes like Fargate to large, distributed deployments on Kubernetes.
In short: Arroyo is a stateful stream processing engine that operates like a stateless one.
Scales easily to any workload
Arroyo is for everyone who needs to process data in real-time. Small use-cases can run with just a few MBs of RAM and a fractional vCPU.
For larger streams, Arroyo can rescale vertically and horizontally to process tens of millions of events per seconds while maintaining exactly-once semantics.
Features
Process data using sliding, tumbling, and session windows with watermark processing to determine when all data for a window has arrived.
Arroyo SQL covers a full set of streaming joins, including left, outer, inner, and full, which can be windowed or operate over updating data.
Arroyo ships with over 300 SQL window, aggregate, and scalar functions, covering math, arrays, regex, json, and more.
Exactly-once processing means no duplicated or dropped events, even with out-of-order data and machine failures.
Arroyo can natively read and write JSON, Avro, Parquet, and raw text and binary. Custom formats can be implemented with UDFs.
Extend the built-in SQL by writing Rust user-defined scalar, aggregate, and async functions, with Python coming soon.
Manage connections, develop and test SQL queries, and monitor pipelines from the powerful Arroyo Web UI.
Pipelines can be created, operated, and managed with the REST API, offering declarative orchestration at scale.
Well connected
Arroyo ships with tons of connectors, making it easy to integrate into your data stack
Real-time with Arroyo
With Arroyo, you can build streaming pipelines by writing the same analytical SQL queries you are already running in your data warehouse.
CREATE TABLE mastodon (
value TEXT
) WITH (
connector = 'sse',
format = 'raw_string',
endpoint = 'http://mastodon.arroyo.dev/api/v1/streaming/public',
events = 'update'
);
CREATE VIEW tags AS (
SELECT btrim(unnest(tags), '"') as tag FROM (
SELECT extract_json(value, '$.tags[*].name') AS tags
FROM mastodon)
);
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY window
ORDER BY count DESC) as row_num
FROM (SELECT count(*) as count,
tag,
hop(interval '5 seconds', interval '15 minutes') as window
FROM tags
group by tag, window)) WHERE row_num <= 5;
Recent posts from the blog
Talk: Latency, Throughput, Fault Tolerance
November 1, 2024
Arroyo creator Micah Wylde recently spoke at P99Conf, discussing how Arroyo achieves low-latency and high-throughput while maintaining fault tolerance and fast recovery times
Announcing Arroyo 0.12.0
September 24, 2024
Arroyo 0.12 is now available! This release introduces Python UDFs, Protobuf ingestion, JSON syntax, custom state TTLs, and many other features, improvements, and fixes.
Serverless Arroyo pipelines on Fly.io
August 28, 2024
Arroyo is the easiest way to build real-time data pipelines, and Fly.io is the easiest way to run them. This tutorial shows how to use the new pipeline cluster feature in Arroyo 0.11 to build a streaming pipeline and a web app that consumes it, all running on Fly's serverless infrastructure.