Sunday 20 July 2008

Alternative EDI Formats Part II – JSON & Protocol Buffers

In the previous post I wrote how a large amount of EDI (that is Electronic Data Interchangein the widest sense) is done, without using a strict formalised standard, using CSV formats. Now Google has released details of how they execute server-to-server/program-to-program message interchange using Protocol Buffers. You won’t see the term EDI any where on Google but then the term doesn’t have a sexy web 2.0 image.

Google rejected the use of XML. I am all for that. To be fair, I think this is more to do with the desire for a binary format for super fast, supper scalable encoding and decoding. Inter-company EDI is universally text based. I can’t see that changing.

The first thing I noticed about the .proto files is their similarity to JSON. Their use seems to have pre-dated the popularisation of JSON. In other areas I have seen Google use YAML for similar definition purposes.

The .proto files are not message files. They are not sent as part of a message, ever. They are used to automatically compile programs to handle messages in the format defined by these files.

Now this struck me, because I think this is one area where CSV beats traditional EDI standards. That first row, of column headings, is like the file definition. If a trading partner adds new columns (or removes columns, or moves columns) the next time he sends the same type of message, it doesn’t matter. We don’t need to agree beforehand. The reciever can identify which cell is which piece of information by locating the column heading position.

Stripping the .proto example down to equate it with our first simplified JSON message data from the previous post, we get the following,

[[‘Jodie Foster’,1,’jfoster@silence.com’,’555-1234’],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,’555-9876’],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,’555-2468’]]

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
repeated PhoneNumber phone = 4;
}


The field list is in a [Modifier – Type – Field Name – Sequence] format. Modifier and Type wouldn’t make much sense in JSON which is not restrictive in its type usage. Incorporating the sequence number into our JSON definition section gives us a useful ability.

{‘definition’:{‘name’:0,’id’:1,’email’:2}}

MessageObject.definition[‘name’] returns 0
Or,
MessageObject.definition.name returns 0

MessageObject.data[0][MessageObject.definition[‘name’]] returns Jodie Foster

Now we have the same ability to cope with our trading partners adding, moving and removing fields without the format losing its meaning.

<aside> Did you notice Google started numbering at 1 and not 0? What is that about? That is Muggle thinking! </aside>

What happens when we expand the pone field into a sub-table like before? On its own this sub-table would have a definition of,

{‘phonenumber’:0,’type’:1}

but we can't just slot this in and replace the existing phone field definition becuase we would lose the positional data. What Protocol Buffers does is list the definitions separately.

{‘definition’:{‘person’:{‘name’:0,’id’:1,’email’:2,’phone’:3},
'phone':{’phonenumber’:0,'type':1}},
‘data’:[[‘Jodie Foster’,1,’jfoster@silence.com’,
[[’555-1234’,'home'],
['555-777','mobile'],
['555-1235','fax']]],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,
[[’555-9876’,'home'],
['555-0101','office']]],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,
[[’555-2468’,'home']]]]}

a=MessageObject.definition['person']['phone']
b=MessageObject.definition['phone']['phonenumber']
c=MessageObject.definition['phone']['phonetype']
MessageObject.data[0][a][2][b]
returns 555-1235
MessageObject.data[0][a][2][c]
returns fax

In this way the sender can omit any fields they like and the field sequence is no longer important. The receiver can still parse the message and extract the data segments. The message file size is kept to a minimum. returns

This is not JSONML (althogh that is intresting in it's own right) . This is about efficiently transporting a (potentially large) list of data objects of the same type.

Thursday 17 July 2008

Alternative EDI Formats Part I – CSV & JSON

I have been meaning to make this post for a long time, then Google came along with Protocol Buffers and the world moves on. So in this post I am going to outline how CSV files are used and how I thought JSON would be an improvement. In another post I will write about what I think can be learnt from Protocol Buffers.

A lot of data is communicated from machine to machine by CSV file format. It might not be strict EDI but it is electronic data interchange. It almost feels like an uncomfortable little secret no one likes to talk about (OK I admit it. I am trying to avoid the Elephant cliché).

To show what I mean, look at the number of responses to these keyword searches on Google. I know it isn't an accurate measure (compare Tradacom with Tradacom & EDI !?!?) but this is just for indicative purposes.













KeywordsNumber of Google Links
X12 46,400,000
X12 & EDI 295,000
EDIFACT 802,000
EDIFACT & EDI 241,000
Tradacom 4,410
Tradacom & EDI 5,350
XML 650,000,000
XML & EDI 451,000
JSON 8,680,000
JSON & EDI 21,000
CSV 52,400,000
CSV & EDI 1,040,000


Note that CSV out ranks all the other terms when combined with EDI. It even out ranks the unqualified EDIFACT search - the ‘UN’ standard for EDI.

Why? Well CSV is easy. It is human readable. It can be output from spreadsheet programs. Most of all, the columns and rows closely resemble the way data is stored in RDBMS tables which is the destination of most EDI data.

Taking inspiration from Google’s Protocol Buffer example, an address book could be represented as follows…

name,id,email,phone
Jodie Foster,1,jfoster@silence.com,555-1234
Sigourney Weaver,2,sweaver@alien.org,555-9876
Drew Barrymore,3,dbarrymore@angel.net,555-2468

All the programmer needs is a ‘splitting’ function to slice the file up, first by carriage returns, then by commas. In JSON format this same data may be represented as follows…

[{‘name‘:’Jodie Foster’,’id’:1,’email’:’jfoster@silence.com’,’phone’:’555-1234’},
{‘name‘:’Sigourney Weaver’, ‘id’:2, ‘email’:’sweaver@alien.org’, ‘phone’:’555-9876’},
{'name‘:’Drew Barrymore’, ‘id’:3, ‘email’:’dbarrymore@angel.net’, ‘phone’:’555-2468’}]

MessageObject[0].name returns Jodie Foster

However the file size has just ballooned. To overcome this, it could be represented in JSON another way to produce a much smaller file…

{‘definition’:[‘name’,’id’,’email’,’phone’],
‘data’:[[‘Jodie Foster’,1,’jfoster@silence.com’,’555-1234’],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,’555-9876’],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,’555-2468’]]}

MessageObject.definition[0] returns name
MessageObject.data[0][0] returns Jodie Foster

Now suppose Ms Foster is good enough to give us her mobile and fax number in addition. The ‘phone’ field becomes a list. For the CSV file, another delimiter is needed.

name,id,email,phone
Jodie Foster,1,jfoster@silence.com,555-1234/555-777/555-1235
Sigourney Weaver,2,sweaver@alien.org,555-9876
Drew Barrymore,3,dbarrymore@angel.net,555-2468

But what if we want to hold phone number type as well (home, mobile, office, fax etc.)? We have 3 options…
1. add another field, also sub-delimited, where the sequencing matches the other field. 555-1234/555-777/555-1235,home/mobile/fax
2. turn the ‘phone’ field into a compound field. 555-1234]home/555-777]mobile/555-1235,home. The column heading becomes phone/type.
3. create a separate table for the fields. Rows in this new table need a unique identifier to rows in the original table.

At this point the CSV format is beginning to creek. Beyond 1 nested table, options 1 & 2 will require ever more different delimiters. So let us concentrate on option 3. In isolation this new sub-table would look like this,

id,phone,type
1,555-1234,home
1,555-777,mobile
1,555-1235,fax
2,555-9876,home
2,555-0101,office
3,555-2468,home

These files can be sent separately. If they are to be combined into 1 message then we need to indicate in some way what table each row is part of. Typically this is done by reserving the first column. In this example it could contain phoneheader-definition, phoneheader-data, phonedetail-definition, phonedetail-data.

How would we represent this in our JSON format?

{‘definition’:[‘name’,’id’,’email’,[’phone’,'type']],
‘data’:[[‘Jodie Foster’,1,’jfoster@silence.com’,
[[’555-1234’,'home'],
['555-777','mobile'],
['555-1235','fax']]],
[‘Sigourney Weaver’,2,’sweaver@alien.org’,
[[’555-9876’,'home'],
['555-0101','office']]],
[‘Drew Barrymore’,3,’dbarrymore@angel.net’,
[[’555-2468’,'home']]]]}

MessageObject.definition[3][0] returns phone
MessageObject.data[0][3][2][0] returns 555-1235
MessageObject.data[0][3][2][1] returns fax

While this encodes and represents the same message, is it better than CSV?
It is more extendable, it is slightly bigger, it is probably equally as human readable, and probably equally as machine readable. I already thought JSON was a good candidate for being the next CSV for EDI. In the next post I will write about how taking inspiration from Google’s Protocol Buffers, I think it can be improved further.

Wednesday 9 July 2008

Green Coffee XML

I am not kidding (pdf). Some might think this is great. Some might think is shows how wonderful XML is. I don't. To me it represents a lot of what is mixed up about EDI (Electronic Data Interchange). I want to make 2 points...



What is so special about Green Coffee that it needs it's own schema?

  • Well reading the docs it seems coffee dealers are a bit fussy about defining when ownership of the product and ownership of the risk (associated with product delivery) is transferred. So they have 9 different order types.
  • As well as the buyer and seller, they need to be precise about the Broker and the Shipper.
  • The quality of the product is defined by a standard and is reflected in the product codes.
  • Pricing can be by formula.
  • Unit of measure is usually Kgs but when it comes to weighing coffee it seems to be important who weighs, when, and who pays for the weighing. I count 8 weighing types.
  • The journey coffee makes can be long and the value of the coffee at different stages changes so it seems the "place of tender" is important. A simple "delivery date" is not precise enough and must be qualified.

Phew! Complicated. But excuse me. Is any one of these points unique to coffee? Maybe the combination is unique. Maybe it is more sophisticated than Acme retail EDI. But what does it gain us to reject all that has gone before in 60 years of EDI and create new EDI ghettos ?


I hope they didn't. I hope they just defined some extra tags and specified some extra attribute values, and added them on to some existing, already utilised and proven XML order standard. Which brings me to my next point.


How (for the love of coffee!) can I implement this?


I went in search of the technical details. The PDF document listed 4 XML Appendices on the contents page. They seem to be missing from the web. I went to root URL and clicked around. I couldn't even find my way back to the document. I used Google to search the site - zilch. I used Google to search the web for "Green Coffee XML", no luck.

How can you expect a schema to be used if you wont tell anybody the details? If you want it to succeed make it freely available! Have you not heard of Peer Review?