Blog

Updates from the Arroyo team

Confluent & Arroyo: Partnering to Bring Real-time SQL to Kafka

We are excited to announce that Arroyo is now a Connect with Confluent Partner, making it easier than ever for Confluent customers to integrate with the Arroyo platform. Arroyo extends Kafka with powerful stateful stream processing support, enabling businesses to analyze their data in real-time using SQL.

Micah Wylde

Micah Wylde

CEO of Arroyo

I'm very excited to announce that Arroyo has joined the Connect with Confluent partner program, making it easier than ever for Confluent customers to integrate with the Arroyo platform.

Confluent Cloud is the data stream platform that provides cloud-native Apache Kafka via its revolutionary Kora engine. Arroyo extends Kafka with powerful stateful stream processing support, enabling businesses to analyze their data in real-time using SQL.

You can get started with Confluent Cloud by signing up for a free trial, then integrate Arroyo in minutes by following the integration guide.

New to Arroyo?

Arroyo is an open-source stream processing engine, enabling users to transform, filter, aggregate, and join their data streams in real-timeā€”just by writing SQL. Designed for the modern cloud environment, it runs natively on Kubernetes and seamlessly scales while providing exactly-once processing.

Arroyo works with many variants of Kafka, but with this partnership Arroyo provides a best-in-class experience for real-time SQL processing with Confluent Kafka.

Why Arroyo and Confluent

Arroyo enhances Confluent Cloud by adding powerful SQL capabilities to Confluent Kafka streams. It enables users to build reliable, correct, and efficient end-to-end streaming pipelines with complex, stateful behaviors like windows, group-bys, and joins. Arroyo integrates seamlessly with Confluent Cloud, with a dedicated connector and full support for the Confluent Schema Registry with JSON and Avro.

Kafka is also a great fit for stateful stream processing with Arroyo, which can leverage Kafka's offset model and transactional capabilities to provide exactly-once semantics. This means that when using Kafka (or another transactional connector, like S3) as a source and sink, Arroyo will ensure that events are never skipped or processed twice; in other words, processed exactly once.

Like Confluent Cloud, Arroyo is designed to scale to any data volumeā€”from tens of events to tens of millions per second. Arroyo is fully open-source and can be self-hosted on your cloud with Kubernetes, or run on Arroyo's serverless cloud.

Building with Arroyo and Kafka

Streaming ETL

With Data Warehouse costs increasing exponentially for many companies, streaming ETL with Arroyo offers a way out. You can filter, aggregate, group, and join data on ingestion, saving substantially on query costsā€”while making data available for query within minutes. And with Kafka and the Arroyo FileSystem sink, you get exactly-once processing, meaning no dropped or duplicated events to their S3-backed Data Warehouse.

Real-time Personalization

Consumers today expect that their experiences are personalized and always up-to-date. But that expectation is violated when your promotional systems or other personalization relies on daily data warehouse jobs. With Arroyo, these batch jobs can be migrated seamlessly to real-timeā€”updating the user experience in seconds instead of hours or days.

Operations monitoring

For manufacturing, logistics, and energy companies, real-time observability is a must. Build monitoring and alerting systems for your factories and fulfillment centers, with millisecond latency. Know immediately when something is going wrong to prevent costly outages and mistakes.

Getting started

You can try out Arroyo with Confluent Cloud today in just a few minutes:

  1. If you're not yet a Confluent Cloud user, sign up for a free trial
  2. Start a local Arroyo cluster in Docker with
docker run -p 5115:5115 \
  ghcr.io/arroyosystems/arroyo:latest
  1. Create a Confluent source or sink by following the integration guide
  2. Write SQL to create a pipeline! For example, this query counts the number of orders by store over a 5 minute sliding window:
SELECT store_id, count(*) as count
FROM orders
WHERE amount > 10
GROUP BY
    store_id,
    hop(interval '5 seconds', interval '5 minutes');