Working with real-time data can be daunting. We're working to solve that by building a new stream processing engine that's easy enough for anyone to use. So how easy is it to solve real-world streaming problems with Arroyo today? I decided to find out.
You can find all of the code for the tutorial on Github: ArroyoSystems/analytics-tutorial.
Walkthrough
Ngrok
To run this locally, you will need to create a tunnel from the internet to your computer. For the tutorial, we used ngrok but there are many similar tools.
Instrumenting our site
We start by instrumenting our site. For the tutorial, we used our homepage (https://arroyo.dev) to demonstrate. We need some way to get HTTP events from our visitors' browsers to our server.
While there are many ways to accomplish this, for simplicity we used a small javascript snippet that we wired up to our NextJS application to fire on changes to our NextJS router.
That code can be found in analytics.ts.
It can be integrated into your application by adding that component somewhere in your source code (we added ours in src/app/analytics.tsx)
<ArroyoPageview endpoint="<ngrok endpoint>" />
Vector
Vector is a great swiss-army knife for connecting various data systems and shuttling observability data throughout your data infra. We use its HTTP server source to collect the analytics events and its Kafka sink to expose them to Arroyo, using this vector.toml config file.
Kafka
Kafka is a distributed log that works great with Arroyo due to its ability to provide exactly-once processing. Here we use it as both source and sink to get data to Arroyo and to Postgres, via Debezium.
Arroyo
Arroyo is a powerful yet simple stream processing engine that lets you execute complex SQL queries against streams of data in real-time. Here, Arroyo reads in the raw analytics events from Kafka, performs various windowed aggregations, and writes the results back to Kafka.
The final query we use can be found https://github.com/ArroyoSystems/analytics-tutorial/blob/main/here
Debezium
Debezium supports connecting relational databases like Postgres and MySQL to Kafka, providing both sources to read from DBs and sinks to write to them. Arroyo integrates with Debezium, and here we use it to sink our query results to Postgres.
PostgreSQL
PostgreSQL is a powerful RDBMS that we use to store our results for querying.
Grafana
Grafana makes it easy to build dashboards, and includes a Postgres plugin that allows us to query results directly from the database. We use this to visualize the results.
Running the tutorial
The tutorial components are packaged up via Docker compose. With Docker installed, you should just need to run
$ git clone https://github.com/ArroyoSystems/analytics-tutorial.git
$ cd analytics-tutorial
$ docker compose up
Once everything has finished, open http://localhost:5115/pipelines/new to create the pipeline.
Paste in the query here and click "Start Pipeline."
Graphing the results
Open up Grafana at http://localhost:3000. Then create a Postgres data source with the options:
- Host:
postgres
- Database:
postgres
- User
postgres
- TLS/SSL Mode:
disable
Then we can graph the metrics using this query:
SELECT sum(value), time, tag
FROM metrics
WHERE metric = 'views_15_minute' AND $__timeFilter(time)
GROUP BY time, tag
ORDER BY time;
(make sure to change the format to Time series
).
We can show the top pages in a table with the query:
SELECT page, count, rank
FROM top_pages
WHERE time = (
SELECT max(time) from top_pages
)
ORDER BY rank;
Get in touch
If you need help or have any questions/comments/suggestions you can find us on Discord.