We're thrilled to announce the 0.3.0 release of Arroyo, our second minor release as an open-source project. Arroyo is a state-of-the-art stream processing engine designed to allow anyone to build complex, stateful real-time data pipelines with SQL.
Table of Contents
Overview
The Arroyo 0.3 release focused on improving the flexibility of the system and completeness of SQL support, with the MVP for UDF support, DDL statements, and custom event time and watermarks. There have also many substantial improvements to the Web UI, including error reporting, backpressure monitoring, and under-the-hood infrastructure improvements.
We've also greatly expanded our docs since the last release. Check them out at https://doc.arroyo.dev.
New contributors
We are excited to welcome three new contributors to the project with this release:
- @haoxins made their first contribution in #100
- @edmondop made their first contribution in #122
- @chenquan made their first contribution in #147
Thanks to all new and existing contributors!
What's next
Looking forward to the 0.4 release, we have a lot of exciting changes planned. We're adding the ability to create updating tables with native support for Debezium, allowing users to connect Arroyo to relational databases like MySQL and Postgres. Other planned features include external joins, session windows, and Delta Lake integration.
Excited to be part of the future of stream processing? Come chat with the team on our discord, check out a starter issue and submit a PR, and let us know what you'd like to see next in Arroyo!
Features
UDFs
With this release we are shipping initial support for writing user-defined functions (UDFs) in Rust, allowing users to extend SQL with custom business logic. See the udf docs for full details.
For example, we can register a Rust function:
// Returns the great-circle distance between two coordinates
fn gcd(lat1: f64, lon1: f64, lat2: f64, lon2: f64) -> f64 {
let radius = 6371.0;
let dlat = (lat2 - lat1).to_radians();
let dlon = (lon2 - lon1).to_radians();
let a = (dlat / 2.0).sin().powi(2) +
lat1.to_radians().cos() *
lat2.to_radians().cos() *
(dlon / 2.0).sin().powi(2);
let c = 2.0 * a.sqrt().atan2((1.0 - a).sqrt());
radius * c
}
and call it from SQL:
SELECT gcd(src_lat, src_long, dst_lat, dst_long)
FROM orders;
SQL DDL statements
It's now possible to define sources and sinks directly in SQL via CREATE TABLE statements:
CREATE TABLE orders (
customer_id INT,
order_id INT,
date_string TEXT
) WITH (
connection = 'my_kafka',
topic = 'order_topic',
serialization_mode = 'json'
);
These tables can then be selected from and inserted into to read and write from those systems. For example, we can duplicate the orders topic by inserting from it into a new table:
CREATE TABLE orders_copy (
customer_id INT,
order_id INT,
date_string TEXT
) WITH (
connection = 'my_kafka',
topic = 'order_topic',
serialization_mode = 'json'
);
INSERT INTO orders_copy SELECT * FROM orders;
In addition to connection tables, this release also adds support for views and virtual tables, which are helpful for splitting up complex queries into smaller components.
- Feature/inline create table by @jacksonrnewhouse in #101
- Rework sources and sinks to allow for creating tables/views in SQL queries by @jacksonrnewhouse in #107
Custom event time and watermarks
Arroyo now supports custom event time fields and watermarks, allowing users to define their own event time fields and watermarks based on the data in their streams.
When creating a connection table in SQL, it is now possible to define a virtual field generated from the data in the stream and then assign that to be the event time. We can then generate a watermark from that event time field as well.
A complete example looks like this:
CREATE TABLE orders (
customer_id INT,
order_id INT,
date_string TEXT,
event_time TIMESTAMP GENERATED ALWAYS AS (CAST(date_string as TIMESTAMP)),
watermark TIMESTAMP GENERATED ALWAYS AS (event_time - INTERVAL '15' SECOND)
) WITH (
connection = 'my_kafka',
topic = 'order_topic',
serialization_mode = 'json',
event_time_field = 'event_time',
watermark_field = 'watermark'
);
For more on the underlying concepts of event times and watermarks, see the concept docs.
- Support virtual fields and overriding timestamp via event_time_field by @jacksonrnewhouse in #127
- Add ability to configure watermark by specifying a specific override column. by @jacksonrnewhouse in #142
Additional SQL features
Beyond UDFs and DDL statements, we have continued to expand the completeness of our SQL support with addition of case statements and regex functions:
- Allow filters on join computations by @jacksonrnewhouse in #131
- Implement CASE statements by @jacksonrnewhouse in #146
- Adding support for regex_replace and regex_match by @edmondop in #122
- Rework top N window functions by @jacksonrnewhouse in #136
Server-Sent Events source
We've added a new source which allows reading from Server-Sent Events APIs (also called EventSource). SSE is a simple protocol for streaming data from HTTP servers and is a common choice for web applications. See the SSE source documentation for more details, and take a look at the new Mastodon trends tutorial that makes uses of it
- Add event source source operator by @mwylde in #106
- Add HTTP connections and add support for event source tables in SQL by @mwylde in #119
Web UI
This release has seen a ton of improvements to the web UI.
- Show SQL names instead of primitive types in catalog by @jbeisen in #84
- Add backpressure metric by @jbeisen in #109
- Add backpressure graph and color pipeline nodes by @jbeisen in #110
- Add page not found page by @jbeisen in #130
- Use SWR for fetching data for job details page by @jbeisen in #129
- Show operator checkpoint sizes by @jbeisen in #139
- Write eventsource and kafka source errors to db by @jbeisen in #140
- Add Errors tab to job details page by @jbeisen in #149
Improvements
- Improvements to Kafka consumer/producer reliability and correctness by @mwylde in #132
- Implement full_pipeline_codegen proc macro to test pipeline codegen by @jacksonrnewhouse in #135
- Bump datafusion to 25.0.0 by @jacksonrnewhouse in #145
- Add docker.yaml to build and push docker images by @jacksonrnewhouse in #150
- Add basic end-to-end integration test by @mwylde in #108
- Add event tracking by @mwylde #144
- Helm: Create service account for Postgres deployment by @haoxins in #100
- Enforce prettier and eslint in the github pipeline by @jbeisen in #120
- Check formatting on PRs by @jacksonrnewhouse in #124
- Simplify the code by @chenquan in #147
Fixes
- Fix UI for creating non-raw kafka sources by @jacksonrnewhouse in #95
- Fix query compilation error when sorting floats by @mwylde in #105
- Fix top n optimization for sliding windows by @mwylde in #114
- Fix some edge-cases around
make_array()
by @jacksonrnewhouse in #123 - Fix overlapping graph nodes for joins by @jbeisen in #134
- Fix data fetching when getClient involves a react hook by @mwylde in #137
- Fix regression in nomad cluster cleaning by @mwylde in #113
- Remove AggregatingStrategy, as that is handled differently by @jacksonrnewhouse in #151
The full change-log is available here.