Building an Apache Arrow Flight SQL Server in Java

Building an Apache Arrow Flight SQL Server in Java

Apache Arrow defines a language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Apache Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Apache Arrow is the standard for operating at speed with data. All major data platforms support the format. On top of the memory format, the Apache Arrow project defines two protocols: Apache Arrow Flight and Apache Arrow Flight SQL. Apache Arrow Flight focuses on transferring data from a (set of) server(s) to a client. Apache Arrow Flight SQL extends this protocol with higher-level primitives typically found in a relational database (thus the name Flight SQL).

In this post, we will explore how to build a simple relational database using Apache Arrow Flight SQL and Apache DataFusion (a relational query engine). In particular, we’ll be using the Java programming language, with the Apache Arrow Java implementation, and the DataFusion Java bindings, which are not yet officially part of Apache Arrow or Apache DataFusion but live in the DataFusion-Contrib repository.

Read More

What's happening at home?

What's happening at home?

Enough with the purely random time series. Let’s dive into generating time series that are based on a model simulation (yes, still randomness, but with some recognizable parts). For example, let’s generate a set of time series based on fictive sensors in a fictive apartment building.

This apartment we are simulating has five different sensors:

  1. Temperature: in degrees Celsius
  2. Humidity: percentage humidity
  3. CO2: parts per million (ppm) of CO2 (rises with the presence of humans)
  4. Light: light strength in lux
  5. Motion: is motion is detected?
Read More

Inverse Fourier for Repeating Pattern

Inverse Fourier for Repeating Pattern

The Fourier transform decomposes a signal into a sum of weighted sine waves. The rational behind the method is that slow moving sine waves capture the general trend of the time series. Whereas the fast moving sine waves capture the details in the time series.

Why decompse a time series into sine waves? Noise is seen in the details. Thus, removing the fastest moving sine waves corresponds to removing noise. Denoising is an example of a function that is easily implemented/expressed on sine waves, but very difficult on the original time series.

In this blog post, we will use the inverse of the Fourier transform: starting from a set of sine wave, we re-compose the time series. The contribution of each sine wave in the set is randomly chosen.

Read More

Walking Randomly

Walking Randomly

In the previous blog post, I wrote about generating random time series data. It was a first taste of time series and generating data with Python. In this post, I want to add historical context to the time series data. A (Gaussian) random walk, takes a random step at each time step. The random step is drawn from a normal distribution and added to the value of the previous point. As such, historical context is built up.

Read More

Generating Test Data

Generating Test Data

I’m writing this blog to learn about time series, programming in Python and Rust and database architecture. In this post, I want to get started with generating time series data.

What is a time series? Simply put, a time series is a sequence of data points indexed in time order. You encounter time series data every day — think of sensor measurements, sales data, stock prices, and weather forecasts.

The goal is to get a feel for what time series data looks like, what types of time series there are, what some properties are, and how to generate realistic looking fake data. Having test data with known properties will be extremely useful for testing and benchmarking time series databases.

Read More

Welcome to my blog

Welcome to my blog

Hi!

I’m Joris, a programmer and database enthusiast. I’m currently a Staff Software Engineer at a company that builds time series analytics software.

Why am I starting this blog? Because of the evolution of database technologies and the new possibilities they bring. In this space, I’ll explore topics related to database architecture, time series, query languages, composable data systems, and programming in general.

The recent emergence of technologies like Apache Arrow & DataFusion, DuckDB, and the data lakehouse architecture has made building custom databases more accessible than ever. These innovations are reshaping the data landscape, and I’m excited to delve into them and share my findings with you.

Everything on this blog represents my personal views, evolving over time. I hope my writing inspires you to explore these topics further and perhaps even challenge my perspectives. Feel free to reach out with your thoughts, questions, or suggestions—I’d love to hear from you!

Read More