Saturday, December 22, 2018

Data Encoding


In this article, we will look into several format of data encoding that includes JSON, XML, Thrift and Avro. We will mainly discuss the pros and cons of each encoding format.

The encoding formats mentioned above are standard formats. However, most of the high level language comes with built in support for encoding in-memory object into byte sequence. For example java has java.io.serializable, python has pickle etc. These encoding formats have a number of problems.


  1. The encoding is tied to a particular programming language. If the 2 systems communicating with each other use different programming language then this will not work.
  2. To restore the data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is a security problem. If an attacker can get your application to decode an arbitrary byte sequence then they can instantiate arbitrary classes.
  3. It is not very efficient in terms of CPU utilisation to serialize and deserialize an object

JSON and XML encoding

  • JSON and XML are textual format and so they are human readable. In XML, you cannot distinguish between a number and string that happens to contain only numbers. JSON does not distinguish between integers and floating point numbers. 

  • Both JSON and XML have support for schema but it is quite complicated. Use of XML schema is quite common though. 

  • There is not support for binary formats in JSON and XML. Of course, you can encode it using base64 and communicate the data but encoding using base64 increases the size of the document by 30%.
  • Both the formats are quite verbose. JSON is less verbose than XML but both still use a lot of space compared to binary format.
  • When data is transferred, the field name and value are transferred in each data document.
    For example consider the below JSON.
                                                       {
                                                         "name" : "peter",
                                                          "employeeId" : 1234,
                                                           "interests" : ["blogging" , "hacking"]
                                                       }
    If 1 million such documents are transferred, in all the million document the fieldName (favoriteAnimal and favoriteColor) are also transferred. The binary format will not contain space and text format will also contain the space.
Apache Thrift
   
Apache Thrift was originally developed at Facebook and made open source later on. It requires a schema for any data that is encoded. E.g

struct Employee {
  1: required string name,
       2: required i64 employeeId,
              3: optional list<string> interests
}

Thrift comes with a code generation tool that produce classes that implement the schema so the application code can use this code to encode and decode records.

The big difference compared to JSON and XML is that there are no field names in the encoded data. The field names are replaced by field tags (1,2,3). It is a compact way of saying what field we are talking about, without having to spell out the field name.

For example, the JSON example above takes  69 bytes when encoded in JSON. If the same thing is encoded in Thrift, it would take 55 bytes.

Apache Avro

Apache Avro is another binary encoding format that is interestingly different from Thrift.  Avro also uses a schema to specify the structure of the data being encoded.

Example schema in Avro:
    record Employee {
                string name,
                long employeeId,
                union {null, array<string>} interests = null
              }

Note that there is no tag number in schema (In the thrift example, the tag numbers are 1,2,3). Thereby the encoded data is most compacted of all the encoding we discussed so far.

When an application wants to encode data, it encodes the data using whatever version of schema it knows about. This is known as writer schema. When an application wants to decode data, it is expecting data in some schema, which is known as reader's schema.
The key idea here is that the reader's and writer's schema don't need to be the same - they only need to be compatible.

Merits of Schemas

Although the textual data format such as JSON, XML are widespread, binary encoding based on schemas are also viable option. They have number of nice property.


  1. They can be much more compact than various binary JSON variants, since they omit the field names from the encoded data.
  2. The schema is valuable form of documentation because the schema is required for decoding, you can be sure that it is latest.
  3. Keeping a history of schemas allows you to check forward and backward compatibility of schema changes.
  4. For statistically typed programming language, the ability to generate code from the schema is useful, since it enables type checking at compile time.