Thursday, 17 July 2008

Alternative EDI Formats Part I – CSV & JSON

I have been meaning to make this post for a long time, then Google came along with Protocol Buffers and the world moves on. So in this post I am going to outline how CSV files are used and how I thought JSON would be an improvement. In another post I will write about what I think can be learnt from Protocol Buffers.

A lot of data is communicated from machine to machine by CSV file format. It might not be strict EDI but it is electronic data interchange. It almost feels like an uncomfortable little secret no one likes to talk about (OK I admit it. I am trying to avoid the Elephant cliché).

To show what I mean, look at the number of responses to these keyword searches on Google. I know it isn't an accurate measure (compare Tradacom with Tradacom & EDI !?!?) but this is just for indicative purposes.













KeywordsNumber of Google Links
X12 46,400,000
X12 & EDI 295,000
EDIFACT 802,000
EDIFACT & EDI 241,000
Tradacom 4,410
Tradacom & EDI 5,350
XML 650,000,000
XML & EDI 451,000
JSON 8,680,000
JSON & EDI 21,000
CSV 52,400,000
CSV & EDI 1,040,000


Note that CSV out ranks all the other terms when combined with EDI. It even out ranks the unqualified EDIFACT search - the ‘UN’ standard for EDI.

Why? Well CSV is easy. It is human readable. It can be output from spreadsheet programs. Most of all, the columns and rows closely resemble the way data is stored in RDBMS tables which is the destination of most EDI data.

Taking inspiration from Google’s Protocol Buffer example, an address book could be represented as follows…

name,id,email,phone
Jodie Foster,1,jfoster@silence.com,555-1234
Sigourney Weaver,2,sweaver@alien.org,555-9876
Drew Barrymore,3,dbarrymore@angel.net,555-2468

All the programmer needs is a ‘splitting’ function to slice the file up, first by carriage returns, then by commas. In JSON format this same data may be represented as follows…

[{‘name‘:’Jodie Foster’,’id’:1,’email’:’jfoster@silence.com’,’phone’:’555-1234’},
{‘name‘:’Sigourney Weaver’, ‘id’:2, ‘email’:’sweaver@alien.org’, ‘phone’:’555-9876’},
{'name‘:’Drew Barrymore’, ‘id’:3, ‘email’:’dbarrymore@angel.net’, ‘phone’:’555-2468’}]

MessageObject[0].name returns Jodie Foster

However the file size has just ballooned. To overcome this, it could be represented in JSON another way to produce a much smaller file…

{‘definition’:[‘name’,’id’,’email’,’phone’],
‘data’:[[‘Jodie Foster’,1,’jfoster@silence.com’,’555-1234’],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,’555-9876’],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,’555-2468’]]}

MessageObject.definition[0] returns name
MessageObject.data[0][0] returns Jodie Foster

Now suppose Ms Foster is good enough to give us her mobile and fax number in addition. The ‘phone’ field becomes a list. For the CSV file, another delimiter is needed.

name,id,email,phone
Jodie Foster,1,jfoster@silence.com,555-1234/555-777/555-1235
Sigourney Weaver,2,sweaver@alien.org,555-9876
Drew Barrymore,3,dbarrymore@angel.net,555-2468

But what if we want to hold phone number type as well (home, mobile, office, fax etc.)? We have 3 options…
1. add another field, also sub-delimited, where the sequencing matches the other field. 555-1234/555-777/555-1235,home/mobile/fax
2. turn the ‘phone’ field into a compound field. 555-1234]home/555-777]mobile/555-1235,home. The column heading becomes phone/type.
3. create a separate table for the fields. Rows in this new table need a unique identifier to rows in the original table.

At this point the CSV format is beginning to creek. Beyond 1 nested table, options 1 & 2 will require ever more different delimiters. So let us concentrate on option 3. In isolation this new sub-table would look like this,

id,phone,type
1,555-1234,home
1,555-777,mobile
1,555-1235,fax
2,555-9876,home
2,555-0101,office
3,555-2468,home

These files can be sent separately. If they are to be combined into 1 message then we need to indicate in some way what table each row is part of. Typically this is done by reserving the first column. In this example it could contain phoneheader-definition, phoneheader-data, phonedetail-definition, phonedetail-data.

How would we represent this in our JSON format?

{‘definition’:[‘name’,’id’,’email’,[’phone’,'type']],
‘data’:[[‘Jodie Foster’,1,’jfoster@silence.com’,
[[’555-1234’,'home'],
['555-777','mobile'],
['555-1235','fax']]],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,
[[’555-9876’,'home'],
['555-0101','office']]],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,
[[’555-2468’,'home']]]]}

MessageObject.definition[3][0] returns phone
MessageObject.data[0][3][2][0] returns 555-1235
MessageObject.data[0][3][2][1] returns fax

While this encodes and represents the same message, is it better than CSV?
It is more extendable, it is slightly bigger, it is probably equally as human readable, and probably equally as machine readable. I already thought JSON was a good candidate for being the next CSV for EDI. In the next post I will write about how taking inspiration from Google’s Protocol Buffers, I think it can be improved further.