Readable and Open file Formats

Find here a vivid and enthousiastic plea for Open and (human) Readable file formats (the discussion spawned from one on XML) by Robert Brown. I ripped this (with his permission) from the Beowulf mailing list archives.

On Tue, 14 Oct 2003, Dale Harris wrote:

> On Wed, Oct 01, 2003 at 10:33:29AM -0400, Robert G. Brown elucidated:
> > 
> > <?xml version="1.0"?>
> > <sensors>
> >   <cpu_temperature id="0" units="C">54.2</cpu_temperature>
> 
> 
> 
> You know... one problem I see with this, assuming this information is
> going to pass across the net (or did I miss something).  Is that instead
> of passing something like four bytes (ie "54.2"), you are going to be
> passing 56 bytes (just counting the cpu_temp line).  So the XML blows up
> a little bit of data 14 times.  I can't see this being particularly
> efficient way of using a network.  Sure, it looks pretty, but seems like
> a waste of bandwidth.  

Ah, an open invitation to waste a little more:-)

Permit me to rant (the following can be freely skipped by the
rant-averse:-).  Note that this is not a flame, merely an impassioned
assertion of an admittedly personal religious viewpoint.  Like similar
rants concerning the virtues of C vs C++ vs Fortran vs Java or Python vs
Perl, it is intended to amuse or possible educate, but doubtless won't
change many human minds.

<rant> 

This is an interesting question and one I kicked around a long time when
designing xmlsysd.  Of course it is also a very longstanding issue -- as
old as computers or just about.  Binary formats (with need for endian
etc translation) are obviously the most efficient but are impossible to
read casually and difficult to maintain or modify.  Compressed binary
(or binary that only uses e.g.  one bit where one bit will do) the most
impossible and most difficult.  Back in the old days, memory and
bandwidth on all computers was a precious and rare thing.  ALL programs
tended to use one bit where one bit was enough.  Entire formats with
headers and metadata and all were created where every bit was
parsimoniously allocated out of a limited pool.  Naturally, those
allocations proved to be inadequate in the long run so that only a few
years ago lilo would complain if the boot partition had more than 1023
divisions because once upon a time somebody decided that 10 bits was all
this particular field was ever going to get.

In order to parse such a binary stream, it is almost essential to use a
single library to both format and write the stream and to read and parse
it, and to maintain both ends at the same time.  Accessing the data ONLY
occurs through the library calls.

This is a PITA.  Cosmically.  Seriously.  Yes, there are many computer
subsystems that do just this, but they are nightmarish to use even via
the library (which from a practical point of view becomes an API, a
language definition of its own, with its own objects and tools for
creating them and extracting them, and the need to be FULLY DOCUMENTED
at each step as one goes along) and require someone with a high level of
devotion and skill to keep them roughly bugfree.  For example, if you
write your code for single CPU systems, it becomes a major problem to
add support for duals, and then becomes a major problem again to add
support for N-CPU SMPs.  Debugging becomes a multistep problem -- is the
problem in the unit that assembles and provides the data, the encoding
library, the decoding library (both of which are one-offs,
written/maintained just for the base application) or is it in the client
application seeking access to the data?

Fortunately, in the old days, nearly all programming was done by
professional programmers working for a wage for giant (or not so giant)
companies.  Binary interfaces were ideal -- they became Intellectual
Property >>because<< they were opaque and required a special library
whose source was hidden to access the actual binary, which might be
entirely undocumented (except via its API library calls).  BECAUSE they
were so bloomin' hidden an difficult/expensive to modify, software
evolved very, very slowly, breaking like all hell every time e.g. MS
Word went from revision 1 to 2 to 3 to... because of broken binary
incompatibility.

ASCII, OTOH, has the advantage of being (in principle) easy to read.
However, it is easy to make it as obscure and difficult to read as
binary.  Examples abound, but let's pull one from /proc, since the
entire /proc interface is designed around the premise that ascii is good
relative to binary (although that seems to be the sole thing that the
many designers of different subsystems agree on).  When parsing the
basic status data of an application, one can work through:

rgb@lilith|T:105>cat /proc/1214/stat
1214 (pine) S 1205 1214 1205 34816 1214 0 767 0 872 0 22 15 0 0 15 0 0 0
14510 12034048 1413 4294967295 134512640 137380700 3221217248 3221190168
4294959106 0 0 134221827 1073835100 3222429229 0 0 17 0 0 0 22 15 0 0

(which, as you can see, contains the information on the pine application
within which I am currently working on my laptop).

What?  You find that hard to read?  Surely it is obvious that the first
field is the PID, the second the application name (inside parens,
introducing a second, fairly arbitrary delimiter to parse), the runtime
status (which is actually NOT a single character, it can vary) and
then... ooo, my.  Time to check out man proc, kernel source
(/usr/src/linux/fs/proc/array.c) and maybe the procps sources.

One does better with:

rgb@lilith|T:106>cat /proc/1214/status
Name:   pine
State:  S (sleeping)
Tgid:   1214
Pid:    1214
PPid:   1205
TracerPid:      0
Uid:    1337    1337    1337    1337
Gid:    1337    1337    1337    1337
FDSize: 32
Groups: 1337 0 
VmSize:    11752 kB
VmLck:         0 kB
VmRSS:      5652 kB
VmData:     2496 kB
VmStk:        52 kB
VmExe:      2804 kB
VmLib:      3708 kB
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 8000000008001003
SigCgt: 0000000040016c5c
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000

This is an almost human readable view of MUCH of the same data that is
in /proc/stat.  Of course there is the little ASCII encoded hexadecimal
garbage at the bottom that could make strong coders weep (again, without
a fairly explicit guide into what every byte or even BIT in this array
does, as one sort of expects that there are binary masked values stuck
in here).  In this case man proc doesn't help -- because this is
supposedly "human readable" they don't provide a reference there.
Still, some of the stuff that is output by ps aux is clearly there in a
fairly easily parseable form.

Mind you, there are still mysteries.  What are the four UID entries?
What is the resolution on the memory, and are kB x1000 or x1024?  What
about the rest of the data in /proc/stat (as there are a lot more fields
there).  What about the contents of /proc/PID/statm? (Or heavens
preserve us, /proc/PID/maps)?  

Finally, what about other things in /proc, e.g.:

rgb@lilith|T:119>cat /proc/stat
cpu  3498 0 2122 239197
cpu0 3498 0 2122 239197
page 128909 55007
swap 1 0
intr 279199 244817 13604 0 3427 6 0 4 4 1 3 2 2 1436 0 15893 0
disk_io: (3,0):(15946,11130,257194,4816,109992) 
ctxt 335774
btime 1066170139
processes 1261

Again, ASCII yes, but now (count them) there are whitespace, :, (, and
',' separators, and one piece of data (the CPU's index) is a part of a
field value (cpu0) so that the entire string "cpu" becomes a sort of
separator (but only in one of the lines).  An impressive ratio of
separators used to field labels.  I won't even begin to address the LIVE
VILE EVIL of overloading nested data structures nested in sequential,
arbitrary separators inside the "values" for a single field, disk_io (or
is that disk_io:?)

If this isn't enough for you, consider /proc/net/dev, which has two
separators (: and ws) but is in COLUMNS, /proc/bus/pci/devices (which I
still haven't figured out) and yes, the aforementioned sensors interface
in /proc.

I offer all of the above as evidence of a fairly evil (did you ever
notice how evil, live, vile, veil and elvi are all anagrams of one
another he asks in a mindless parenthetical insertion to see if you're
still awake:-) middle ground between a true binary interface accessible
only through library calls (which can actually be fairly clean, if one
creates objects/structs with enough mojo to hold the requisite data
types so that one can then create a relatively simple set of methods for
accessing them) and xml.

XML is the opposite end of the binary spectrum.  It asserts as its
primary design principle that the objects/structs with the right kind of
mojo share certain features -- precisely those that constitute the
rigorous design requirements of XML (nesting, attributes, values, etc).
There is a fairly obvious mapping between a C struct, a C++ object, and
an XMLified table.  It also asserts implicitly that whether or not the
object tags are chosen to be human readable (nobody insists that the
tags encapsulating CPU temperature readings be named <cpu_temperature>
-- they could have been just <t>) there MUST be some sort of dictionary
created at the same time as the XML implementation.  If (very) human
readable tags are chosen they are nearly self-documenting, but whole
layers of DTD and CSS and so forth treatment of XML compliant markup are
predicated upon a clear definition of the tag rules and hierarchy.

Oh, and by its very design XML is highly scalable and extensible.  Just
as one can easily enough add fields into a struct without breaking code
that uses existing fields, one can often add tags into an XML document
description without breaking existing tags or tag processing code
(compare with adding a field anywhere into /proc/stat -- ooo, disaster).
This isn't always the case in either case -- sometimes one converts a
field in a struct into a struct in its own right, for example, which can
do violence to both the struct and an XML realization of it.  Still,
often one can and when one can't it is usually because you've had a
serious insight into the "right" way to structure your data and before
the encoding was just plain wrong in some deep way.  This happens, but
generally only fairly early in the design and implementation process.

Note that XML need not be inefficient in transit.  BECAUSE it is so
highly structured, it compresses very efficiently.  Library calls exist
to squeeze out insignificant whitespace, for example (ignored by the
parser anyway).  I haven't checked recently to see whether compression
is making its way into the library, but either way one can certainly
compress/decompress and/or encrypt/decrypt the assembled XML messages
before/after transmission, if CPU is cheaper to you than network or
security is an issue.

I think that it then comes down to the following.  XML may or may not be
perfect, but it does form the basis for a highly consistent
representation of data structures that is NOT OPAQUE and is EASILY
CREATED AND EASILY PARSED with STANDARD TOOLS AND LIBRARIES.  When
designing an XMLish "language" for your data, you can make the same kind
of choices that you face in any program. Do you document your code or
not?  Do you use lots of variable names like egrp1 or do you write out
something roughly human readable like extra_group_1?  Do you write your
loops so that they correspond to the actual formulae or basic algorithm
(and let the compiler do as well as it can with them) or do you block
them out to be cache-friendly, insert inline assembler, and so forth to
make them much faster but impossible to read or remember even yourself
six months after you write them?  Some choices make the code run fast
and short but hard to maintain.  Other choices make it run slower but be
more readable and easier to maintain.

In the long run, I think most programmers eventually come to a sort of
state of natural economy in most of these decisions; one that expresses
their personal style, the requirements of their job, the requirements of
the task, and a reflection of their experience(s) coding.  It is a
cost/benefit problem, after all (as is so much in computing).  You have
to ask how much it costs you to do something X way instead of Y way, and
what the payoff/benefits are, in the long run.

For myself only, years of experience have convinced me that as far as
things like /proc or task/hardware monitoring are concerned, the
bandwidth vs ease of development and maintenance question comes down
solidly in favor of ease of development and maintenance.  Huge amounts
of human time are wasted writing parsers and extracting de facto data
dictionaries from raw source (the only place where they apparently
reside).  Tools that are built to collect data from a more or less
arbitrary interface have to be almost completely rewritten when that
interface changes signficantly (or break horribly in the meantime).

So the cost is this human time (programmers'), more human time (the time
and productivity lost by people who lack the many tools a better
interface would doubtless spawn), and the human time and productivity
lost due to the bugs the more complex and opaque and multilayered
interface generates.  The benefit is that you save (as you note)
anywhere from a factor of 3-4 to 10 or more in the total volume of data
delivered by the interface.  Data organization and human readability
come at a price.

But what is the REAL cost of this extra data?  Data on computers is
typically manipulated in pages of memory, and a page is what, 4096
bytes?  Data movement (especially of contiguous data) is also very rapid
on modern computers -- you are talking about saving a very tiny fraction
of a second indeed when you reduce the message from 54 bytes to 4 bytes.
Even on the network, on a 100BT connection one is empirically limited by
LATENCY on messages less than about 1000 bytes in length.  So if you ask
how long it takes to send a 4 byte packet or a 54 byte packet (either
one of which is TCP encapsulated inside a header that is longer than the
data) the answer is that they take exactly the same amount of time
(within a few tens of nanoseconds).

If the data in question is truly a data stream -- a more or less
continuous flow of data going through a channel that represents a true
bottleneck, then one should probably use a true binary representation to
send the data (as e.g. PVM or MPI generally do), handling endian
translation and data integrity and all that.  If the data in question is
a relatively short (no matter how it is wrapped and encoded) and
intermittant source -- as most things like a sensors interface, the proc
interface(s) in general, the configuration file of your choice, and most
net/web services are, arguably -- then working hard to compress or
minimally encapsulate the data in an opaque form is hard to justify in
terms of the time (if any) that it saves, especially on networks, CPUs,
memory that are ever FASTER.  If it doesn't introduce any
human-noticeable delay, and the overall load on the system(s) in
question remain unmeasurably low (as was generally the case with e.g.
the top command ten Moore's Law years or more ago) then why bother?

I think (again noting that this is my own humble opinion:-) that there
is no point.  /proc should be completely rewritten, probably by being
ghosted in e.g. /xmlproc as it is ported a little at a time, to a
single, consistent, well documented xmlish format.  procps should
similarly be rewritten in parallel with this process, as should the
other tools that extract data from /proc and process it for human or
software consumption.  Perhaps experimentation will determine that there
are a FEW places in /proc where the extra overhead of parsing xml isn't
acceptable for SOME applications -- /proc/pid/stat for example.  In
those few cases it may be worthwhile to make the ghosting permanent --
to provide an xmlish view AND a binary or minimal ASCII view, as is
done now, badly, with /proc/pid/stat and /proc/pid/status.

This is especially true, BTW, in open source software, where a major
component of the labor that creates and maintains both low level/back
end service software and high level/front end client software is unpaid,
volunteer, part time, and of a wide range of skill and experience.  Here
the benefits of having a documented, rigorously organized,
straightforwardly parsed API layer between tools are the greatest.

Finally, to give the rotting horse one last kick, xmlified documents
(deviating slightly from API's per se) are ideal for archival storage
purposes.  Microsoft is being scrutinized now by many agencies concerned
about the risks associated from having 90% of our vital services
provided by an operating system that has proven in practice to be
appallingly vulnerable.  Their problem has barely begun.  The REAL
expense associated with using Microsoft-based documents is going to
prove in the long run to be the expense of de-archiving old
proprietary-binary-format documents long after the tools that created
them have gone away.  This is a problem worthy of a rant all by itself
(and I've written one or two in other venues) but it hasn't quite
reached maturity as it requires enough years of document accumulation
and toplevel drift in the binary "standard" before it jumps out and
slaps you in the face with six and seven figure expenses.  XMLish
documents (especially when accompanied by a suitable DTD and/or data
dictionary) simply cannot cost that much to convert because their
formats are intrinsically open.

</rant>

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu





This page was created using Emacs
This page was optimized for ANY BROWSER

Last modified: 25 April, 2003

Back to Anton Feenstra Homepage
feenstra@chem.vu.nl