Sunday 20 July 2008

Alternative EDI Formats Part II – JSON & Protocol Buffers

In the previous post I wrote how a large amount of EDI (that is Electronic Data Interchangein the widest sense) is done, without using a strict formalised standard, using CSV formats. Now Google has released details of how they execute server-to-server/program-to-program message interchange using Protocol Buffers. You won’t see the term EDI any where on Google but then the term doesn’t have a sexy web 2.0 image.

Google rejected the use of XML. I am all for that. To be fair, I think this is more to do with the desire for a binary format for super fast, supper scalable encoding and decoding. Inter-company EDI is universally text based. I can’t see that changing.

The first thing I noticed about the .proto files is their similarity to JSON. Their use seems to have pre-dated the popularisation of JSON. In other areas I have seen Google use YAML for similar definition purposes.

The .proto files are not message files. They are not sent as part of a message, ever. They are used to automatically compile programs to handle messages in the format defined by these files.

Now this struck me, because I think this is one area where CSV beats traditional EDI standards. That first row, of column headings, is like the file definition. If a trading partner adds new columns (or removes columns, or moves columns) the next time he sends the same type of message, it doesn’t matter. We don’t need to agree beforehand. The reciever can identify which cell is which piece of information by locating the column heading position.

Stripping the .proto example down to equate it with our first simplified JSON message data from the previous post, we get the following,

[[‘Jodie Foster’,1,’jfoster@silence.com’,’555-1234’],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,’555-9876’],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,’555-2468’]]

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
repeated PhoneNumber phone = 4;
}


The field list is in a [Modifier – Type – Field Name – Sequence] format. Modifier and Type wouldn’t make much sense in JSON which is not restrictive in its type usage. Incorporating the sequence number into our JSON definition section gives us a useful ability.

{‘definition’:{‘name’:0,’id’:1,’email’:2}}

MessageObject.definition[‘name’] returns 0
Or,
MessageObject.definition.name returns 0

MessageObject.data[0][MessageObject.definition[‘name’]] returns Jodie Foster

Now we have the same ability to cope with our trading partners adding, moving and removing fields without the format losing its meaning.

<aside> Did you notice Google started numbering at 1 and not 0? What is that about? That is Muggle thinking! </aside>

What happens when we expand the pone field into a sub-table like before? On its own this sub-table would have a definition of,

{‘phonenumber’:0,’type’:1}

but we can't just slot this in and replace the existing phone field definition becuase we would lose the positional data. What Protocol Buffers does is list the definitions separately.

{‘definition’:{‘person’:{‘name’:0,’id’:1,’email’:2,’phone’:3},
'phone':{’phonenumber’:0,'type':1}},
‘data’:[[‘Jodie Foster’,1,’jfoster@silence.com’,
[[’555-1234’,'home'],
['555-777','mobile'],
['555-1235','fax']]],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,
[[’555-9876’,'home'],
['555-0101','office']]],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,
[[’555-2468’,'home']]]]}

a=MessageObject.definition['person']['phone']
b=MessageObject.definition['phone']['phonenumber']
c=MessageObject.definition['phone']['phonetype']
MessageObject.data[0][a][2][b]
returns 555-1235
MessageObject.data[0][a][2][c]
returns fax

In this way the sender can omit any fields they like and the field sequence is no longer important. The receiver can still parse the message and extract the data segments. The message file size is kept to a minimum. returns

This is not JSONML (althogh that is intresting in it's own right) . This is about efficiently transporting a (potentially large) list of data objects of the same type.