Database: A Guide to Evolvable Data Models and Encoding (Thrift, Protobuf, Avro)

Schema changes are a critical aspect of application evolution. As applications grow and adapt to new requirements, their data models must also evolve. The ability to handle schema changes efficiently is crucial for maintaining application performance and user experience.

Schema

Encoding

Encoding is the process of converting data into a specific format so that it can be stored or transmitted efficiently. Think of it as a way of packaging data into a format that can be easily saved to disk, loaded into memory, or sent over a network.

Why Encoding Matters ?

When schema changes occur, corresponding changes to application code are usually required to handle the new data structures. However, these code changes can't always be deployed instantaneously, especially in large applications. Rolling upgrades for server-side applications and delayed updates for client-side applications mean that old and new code, along with old and new data formats, may coexist for some time.

To maintain smooth operation during this period, it's essential to ensure both backward and forward compatibility. Backward compatibility allows newer code to read data written by older code, while forward compatibility enables older code to ignore or gracefully handle new additions made by newer code.

Types of encoding

Using language-specific serializing algorithms for encoding in-memory objects into byte sequences is generally discouraged for several reasons:

Due to these issues, it's generally recommended to avoid using language-specific serializing algorithms for anything other than short-term or transient purposes. Instead, using standardized encoding formats like JSON, XML, Protocol Buffers, Thrift, or Avro is preferable.

JSON (JavaScript Object Notation)

JSON is easy to read and write, making it ideal for configuration files, data interchange, and debugging. Supported by most programming languages, allowing universal data interchange. Schema-less format enables flexible and dynamic data structures. Built-in libraries in most programming languages make parsing and generating JSON straightforward.

/* in hexadecimal */ 83 A4 6E 61 6D 65 A8 4A 6F 68 6E 20 44 6F 65 A5 65 6D 61 69 6C B5 6A 6F 68 6E 2E 64 6F 65 40 65 78 61 6D 70 6C 65 2E 63 6F 6D A9 69 6E 74 65 72 65 73 74 73 93 A7 72 65 61 64 69 6E 67 A9 74 72 61 76 65 6C 69 6E 67 A6 63 6F 64 69 6E 67 00 /* Let's decode the given binary output piece by piece. */ 83 - start (with 3 key-value pairs) A4 - string of length 4 6E 61 6D 65 - name (key) A8 - string of length 8 4A 6F 68 6E 20 44 6F 65 - John Doe (value) A5 - string of length 5 65 6D 61 69 6C - email (key) B5 - string of length 20 6A 6F 68 6E 2E 64 6F 65 40 65 78 61 6D 70 6C 65 2E 63 6F 6D - john.doe@example.com (value) A9 - string of length 9 69 6E 74 65 72 65 73 74 73 - interests (key) 93 - array with 3 elements A7 - string of length 7 72 65 61 64 69 6E 67 - reading (value 1) A9 - string of length 9 74 72 61 76 65 6C 69 6E 67 - traveling (value 2) A6 - string of length 6 63 6F 64 69 6E 67 - coding (value 3) 00 - End of struct # Total bytes = 79 bytes

Limitations :

XML (Extensible Markup Language)

 John Doe john.doe@example.com reading traveling coding  

XML is easy to understand and debug, designed to be both human and machine-readable. Allows representation of complex data relationships and hierarchies. Schemas like DTD and XSD enable rigorous validation of document structure and content.

Limitations :

Binary Encoding Formats: Why Use Them?

While JSON and XML are widely used for their readability and flexibility, binary encoding formats like Protocol Buffers (Protobuf), Thrift, and Avro offer significant advantages in terms of efficiency, performance, and data consistency. Here’s why binary encoding formats are often preferred in certain scenarios:

Use Cases in Modern Databases

Apache Thrift Encoding

Apache Thrift is a binary serialization format developed by Facebook for efficient data interchange between different programming languages. Thrift requires an Interface Definition Language (IDL) file to define the structure of the data.

struct UserProfile < 1: required string name, 2: required string email, 3: optional listinterests >

Thrift Binary Protocol

Apache Thrift serializes data according to the Thrift IDL definition and the Thrift Binary Protocol . Here's the binary protocol output for the above example:

/* in hexadecimal */ 0b 00 01 00 00 00 08 4a 6f 68 6e 20 44 6f 65 0b 00 02 00 00 00 14 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 0f 00 03 0b 00 00 00 03 00 00 00 07 72 65 61 64 69 6e 67 00 00 00 09 74 72 61 76 65 6c 69 6e 67 00 00 00 06 63 6f 64 69 6e 67 00 /* Let's decode the given binary output piece by piece. */ 0b - Type: string (11) 00 01 - Field ID: 1 00 00 00 08 - Length: 8 John Doe - 4a 6f 68 6e 20 44 6f 65 (Value) 0b - Type: string (11) 00 02 - Field ID: 2 00 00 00 14 - Length: 20 (~ 14 in hex) john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value) 0f - Type: list (15) 00 03 - Field ID: 3 0b - Element Type: string (11) 00 00 00 03 - Number of Elements: 3 00 00 00 07 - Length: 7 reading - 72 65 61 64 69 6e 67 (Value) 00 00 00 09 - Length: 9 traveling - 74 72 61 76 65 6c 69 6e 67 (Value) 00 00 00 06 - Length: 6 coding - 63 6f 64 69 6e 67 (Value) 00 - End of struct # Total bytes = 86 bytes

To deserialize the Thrift-encoded binary data back into a structured object,

The above encoding might appear to consume more bytes than JSON encoding with MessagePack due to an additional 8 bytes. However, this is because Thrift's binary protocol utilizes 2 bytes to represent each field/key ID and 4 bytes to denote the length of the values. Additionally, Thrift binary protocol can represent larger datasets more efficiently as it does not store the "key" for each entry, resulting in fewer bytes used as the dataset size increases.

Thrift Compact Protocol

The Thrift Compact Protocol is an encoding scheme designed to reduce the size of serialized data for efficient network transmission and storage. It achieves this by using variable-length encoding and other space-saving techniques to represent data more compactly compared to the standard Thrift Binary Protocol.

Here's the compact protocol output for the above example:

/* in hexadecimal */ 18 08 4a 6f 68 6e 20 44 6f 65 18 14 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 19 38 07 72 65 61 64 69 6e 67 09 74 72 61 76 65 6c 69 6e 67 06 63 6f 64 69 6e 67 00 /* Let's decode the given binary output piece by piece. */ 18 - Field ID (1) and type (string, 8) combined 08 - length of string John Doe - 4a 6f 68 6e 20 44 6f 65 (Value) 18 - Field ID (2 = 1+prevID) and type (string, 8) combined 14 - length of string 20 (~ 14 in hex) john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value) 19 - Field ID (3 = 1+prevID) and type (list, 9) combined 38 - count (3) and type(string, 8) combined 07 - length of string reading - 72 65 61 64 69 6e 67 (Value) 09 - length of string traveling - 74 72 61 76 65 6c 69 6e 67 (Value) 06 - length of string coding - 63 6f 64 69 6e 67 (Value) 00 - End of struct # Total bytes = 61 bytes

These optimizations make the Compact Protocol particularly useful in scenarios where bandwidth or storage is limited, such as mobile applications or networked services with high data throughput requirements.

Backward Compatibility in Thrift

Thrift achieves this through optional fields and default values:

Forward Compatibility in Thrift

Forward compatibility ensures that older versions of services can read data written by newer versions. Thrift supports this by allowing fields to be added or removed in a way that doesn’t break older clients or servers.

struct User < 1: required string name; 2: optional i32 age = 0; 3: optional string email; // new field >

Protobuf (Protocol Buffers)

Protocol Buffers (Protobuf) is a language-neutral, platform-neutral, extensible mechanism for serializing structured data. It was developed by Google and is widely used for data interchange in microservices and other applications.

Similar to Thrift, you define the structure of your data in a .proto file. Protobuf generates code in multiple languages based on the schema definition. You use the generated code to serialize and deserialize data to/from a compact binary format. Here's how you would define the same structure in a Protobuf .proto file for the previous example data:

syntax = "proto3"; message UserProfile

Protocol Buffers use a combination of field numbers and wire types to encode data. The key points are:

  1. Field Number : Identifies the field within the message.
  2. Wire Type : Indicates the format of the data for that field (e.g., length-delimited, 32-bit, etc.).
  3. The field number and wire type are packed into a single byte (or more) using the following format:

The ProtoBuf output for the above example:

/* in hexadecimal */ 0a 08 4a 6f 68 6e 20 44 6f 65 12 14 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 1a 07 72 65 61 64 69 6e 67 1a 09 74 72 61 76 65 6c 69 6e 67 1a 06 63 6f 64 69 6e 67 00 /* Let's break it down */ 0a - (00001010) in binary - last three bits (010) represent the wire type (2) - remaining bits (00001) represent the field number (1) 08 - length of string (8) John Doe - 4a 6f 68 6e 20 44 6f 65 (Value) 12 - (00010010) in binary - last three bits (010) represent the wire type (2) - remaining bits (0010) represent the field number (2) 14 - length of string 20 (~ 14 in hex) john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value) 1a - (00011010) in binary - last three bits (010) represent the wire type (2) - remaining bits (0011) represent the field number (3) 07 - length of string (7) reading - 72 65 61 64 69 6e 67 (Value) 1a - Field number 3, wire type 2 09 - length of string traveling - 74 72 61 76 65 6c 69 6e 67 (Value) 1a - Field number 3, wire type 2 06 - length of string coding - 63 6f 64 69 6e 67 (Value) 00 - End of struct # Total bytes = 61 bytes

Backward Compatibility in Protobuf

Backward compatibility in Protobuf is achieved by using field numbers and allowing unknown fields to be ignored:

Forward Compatibility in Protobuf

Forward compatibility in Protobuf allows older services to understand data written by newer versions:

Avro

Apache Avro is a data serialization system that provides rich data structures, compact, fast binary data format, and a container file format to store persistent data. It is designed for use with Hadoop, but it can be used independently of Hadoop as well. Avro uses JSON to define schemas and binary format to encode data. It is schema-based, meaning that data is always serialized along with the schema, making it self-descriptive. First we have to create an Avro schema ( .avsc ) for our data

Use the Avro library to serialize the data according to the schema. Avro uses zigzag encoding for integers to efficiently encode negative numbers. Zigzag encoding maps signed integers to unsigned integers in such a way that small magnitude signed integers (both positive and negative) have small magnitude unsigned integer representations. This is beneficial because Avro uses variable-length encoding for integers, so smaller numbers use fewer bytes.

Zigzag Encoding

In zigzag encoding, the least significant bit (LSB) of the encoded value is used to store the sign of the integer, and the rest of the bits store the absolute value of the integer, shifted to the left. The encoding works as follows:

\(Zigzag\ Encoded\ Value=(n<<1)⊕(n>>31)\)
+5 : 0000 0101 (original binary) : 0000 1010 (shifted left by 1) : 0a (encoded value, decimal = 10) -5 : 1111 1011 (original binary in 2's complement) : 1111 0110 (shifted left by 1) : 0000 1001 (sign bit handling, decimal = 9)

Avro encoded output for the above example is:

/* in hexadecimal */ 10 4a 6f 68 6e 20 44 6f 65 28 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d 06 0e 72 65 61 64 69 6e 67 12 74 72 61 76 65 6c 69 6e 67 0c 63 6f 64 69 6e 67 00 /* Let's break it down */ 10 - 0001 0000 (LSB = 0) - 1 000 (next 4 bits from LSB = 8 in decimal) - Length of string John Doe - 4a 6f 68 6e 20 44 6f 65 (Value) 28 - 0010 1000 (LSB = 0) - 10 100 (next 5 bits from LSB = 20 in decimal) - Length of string john.doe@example.com - 6a 6f 68 6e 2e 64 6f 65 40 65 78 61 6d 70 6c 65 2e 63 6f 6d (Value) 06 - 0000 0110 (LSB = 0) - 11 (next 2 bits from LSB = 3 in decimal) - Length of list 0e - 0000 1110 (LSB = 0) - 111 (next 3 bits from LSB = 7 in decimal) - Length of string reading - 72 65 61 64 69 6e 67 (Value) 12 - 0001 0010 (LSB = 0) - 1001 (next 4 bits from LSB = 9 in decimal) - Length of string traveling - 74 72 61 76 65 6c 69 6e 67 (Value) 0c - 0000 1100 (LSB = 0) - 110 (next 3 bits from LSB = 6 in decimal) - Length of string coding - 63 6f 64 69 6e 67 (Value) 00 - End of struct # Total bytes = 57

Forward and backward compatibility

In Apache Avro, the concept of forward and backward compatibility is achieved through the use of writer and reader schemas. This compatibility allows systems to evolve over time without breaking existing data. Here’s a detailed explanation of how Avro ensures forward and backward compatibility:

Compatibility Rules

Avro does care about the order of fields in the schema, but matches fields by their names. In Avro, there are specific rules that determine whether a schema change is backward or forward compatible. Here’s a detailed explanation:

  1. Adding Fields :
  2. Removing Fields :
  3. Changing Field Types :
  4. Renaming Fields :

Maintaining backward and forward compatibility is essential in distributed systems to ensure smooth communication and service evolution. Apache Thrift, Protocol Buffers, and Apache Avro provide robust mechanisms to handle compatibility:

Understanding these mechanisms allows developers to design resilient systems capable of evolving without disrupting existing services. Each framework has its strengths and use cases, and the choice of which to use depends on the specific requirements of your project.

Dataflow Through Services

In our journey through various encoding methods and their applications, we have explored how different data formats and protocols enable efficient communication between systems. Now, let's delve deeper into how these encoding methods play a crucial role in dataflow through services, particularly focusing on REST (Representational State Transfer) and RPC (Remote Procedure Call). Understanding these concepts is vital as they form the backbone of modern web services and microservices architecture, influencing how systems are designed, deployed, and maintained.

Clients and Servers: The Foundation of Network Communication

The client-server model is the most common arrangement for network communication. In this model, servers expose APIs over the network, and clients make requests to these APIs. This interaction is fundamental to how the web operates, with web browsers acting as clients making HTTP requests to web servers to fetch resources such as HTML, CSS, and JavaScript.

But web browsers are not the only clients. Native apps on mobile devices and desktop computers also make network requests, as do client-side JavaScript applications using techniques like Ajax. In these cases, the server's response is often in a format like JSON, which is convenient for further processing by the client-side application.

Service-Oriented Architecture (SOA) and Microservices

To manage the complexity of large applications, the service-oriented architecture (SOA) approach decomposes applications into smaller, function-specific services. This approach has evolved into what we now call microservices architecture. Microservices promote independently deployable and evolvable services, allowing teams to release new versions without extensive coordination.

In microservices architecture, services provide an application-specific API that controls what clients can do. This encapsulation ensures that services impose fine-grained restrictions, enhancing security and maintainability.

REST and Web Services

When HTTP is used as the protocol for communication, it is termed a web service. Web services are prevalent in various contexts, such as:

  1. Client Applications : Native apps or JavaScript web apps making HTTP requests to services over the internet.
  2. Inter-service Communication : Services within the same organization communicating over HTTP, often within the same datacenter.
  3. Cross-Organizational Communication : Services exchanging data over the internet, such as public APIs for payment processing or OAuth.

REST is a design philosophy built on HTTP principles, emphasizing simple data formats and the use of URLs to identify resources. It leverages HTTP features for cache control, authentication, and content type negotiation, making it a popular choice for microservices and cross-organizational service integration.

SOAP: The Alternative to REST

SOAP (Simple Object Access Protocol) is an XML-based protocol for making network API requests. Unlike REST, SOAP aims to be independent of HTTP, using a complex set of standards known as WS-*. SOAP's API is described using the Web Services Description Language (WSDL), which enables code generation for accessing remote services.

Despite its structured approach, SOAP has fallen out of favor in many organizations due to its complexity and interoperability issues. RESTful APIs, with their simpler, more flexible design, have become the predominant style for public APIs.

Remote Procedure Calls (RPC)

Remote Procedure Calls (RPC) are a powerful mechanism for enabling communication between different services or components over a network. RPC allows a program to cause a procedure (subroutine) to execute in another address space (commonly on another physical machine). This is done in a way that is intended to make the remote procedure call look and behave like a local call within the same program. The concept of RPC has been around since the 1970s and remains a cornerstone of distributed computing.

Key Characteristics :

Challenges with RPC :

Modern RPC Frameworks

Modern RPC frameworks have evolved to address some of these challenges:

These frameworks provide advanced features like service discovery, load balancing, and built-in support for retries and idempotence, making them suitable for building scalable and resilient distributed systems.

Understanding the intricacies of dataflow through services, particularly REST and RPC, is crucial for designing scalable and maintainable systems. By leveraging the right encoding techniques and ensuring compatibility, developers can build robust services that support independent evolution and seamless communication across different platforms and organizational boundaries. As we continue to explore encoding methods and their applications, the principles of REST and RPC will remain foundational to modern web services and microservices architecture.

Message-Passing Dataflow

Previously, we discussed how encoded data flows from one process to another using REST and RPC, where one process sends a request over the network and expects a response quickly. In this final section, we'll explore asynchronous message-passing systems. These systems blend characteristics of both RPC and databases, using an intermediary called a message broker to temporarily store and forward messages. This method provides several benefits, including improved system reliability and logical decoupling of processes.

Asynchronous Communication

In asynchronous message-passing systems, a client sends a message to another process without waiting for an immediate response. This communication pattern is one-way, meaning the sender does not expect a reply. If a response is needed, it is typically sent on a separate channel. This approach contrasts with RPC, where a response is expected promptly.

Role of Message Brokers

A message broker (also known as a message queue or message-oriented middleware) acts as an intermediary between the sender and recipient. Instead of sending the message directly, the client sends it to the broker, which then delivers it to the appropriate recipient. This intermediary role provides several advantages:

Historically, message brokers were dominated by commercial enterprise software from companies like TIBCO, IBM WebSphere, and webMethods. Today, open-source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka are widely used. These brokers vary in delivery semantics and configuration, but generally, they work as follows:

Data Encoding in Message-Passing Systems

Data encoding in message-passing systems refers to the process of converting data into a format that can be efficiently transmitted over a network and interpreted correctly by the receiving process. This involves serializing the data into a byte stream and potentially compressing it to optimize network usage.

Importance of Data Encoding

Common Data Encoding Formats : JSON, XML, Protobuf, Avro, MessagePack

Real-World Message-Passing Systems

RabbitMQ