Find here a vivid and enthousiastic plea for Open and (human) Readable file formats (the discussion spawned from one on XML) by Robert Brown. I ripped this (with his permission) from the Beowulf mailing list archives.
On Tue, 14 Oct 2003, Dale Harris wrote: > On Wed, Oct 01, 2003 at 10:33:29AM -0400, Robert G. Brown elucidated: > > > > <?xml version="1.0"?> > > <sensors> > > <cpu_temperature id="0" units="C">54.2</cpu_temperature> > > > > You know... one problem I see with this, assuming this information is > going to pass across the net (or did I miss something). Is that instead > of passing something like four bytes (ie "54.2"), you are going to be > passing 56 bytes (just counting the cpu_temp line). So the XML blows up > a little bit of data 14 times. I can't see this being particularly > efficient way of using a network. Sure, it looks pretty, but seems like > a waste of bandwidth. Ah, an open invitation to waste a little more:-) Permit me to rant (the following can be freely skipped by the rant-averse:-). Note that this is not a flame, merely an impassioned assertion of an admittedly personal religious viewpoint. Like similar rants concerning the virtues of C vs C++ vs Fortran vs Java or Python vs Perl, it is intended to amuse or possible educate, but doubtless won't change many human minds. <rant> This is an interesting question and one I kicked around a long time when designing xmlsysd. Of course it is also a very longstanding issue -- as old as computers or just about. Binary formats (with need for endian etc translation) are obviously the most efficient but are impossible to read casually and difficult to maintain or modify. Compressed binary (or binary that only uses e.g. one bit where one bit will do) the most impossible and most difficult. Back in the old days, memory and bandwidth on all computers was a precious and rare thing. ALL programs tended to use one bit where one bit was enough. Entire formats with headers and metadata and all were created where every bit was parsimoniously allocated out of a limited pool. Naturally, those allocations proved to be inadequate in the long run so that only a few years ago lilo would complain if the boot partition had more than 1023 divisions because once upon a time somebody decided that 10 bits was all this particular field was ever going to get. In order to parse such a binary stream, it is almost essential to use a single library to both format and write the stream and to read and parse it, and to maintain both ends at the same time. Accessing the data ONLY occurs through the library calls. This is a PITA. Cosmically. Seriously. Yes, there are many computer subsystems that do just this, but they are nightmarish to use even via the library (which from a practical point of view becomes an API, a language definition of its own, with its own objects and tools for creating them and extracting them, and the need to be FULLY DOCUMENTED at each step as one goes along) and require someone with a high level of devotion and skill to keep them roughly bugfree. For example, if you write your code for single CPU systems, it becomes a major problem to add support for duals, and then becomes a major problem again to add support for N-CPU SMPs. Debugging becomes a multistep problem -- is the problem in the unit that assembles and provides the data, the encoding library, the decoding library (both of which are one-offs, written/maintained just for the base application) or is it in the client application seeking access to the data? Fortunately, in the old days, nearly all programming was done by professional programmers working for a wage for giant (or not so giant) companies. Binary interfaces were ideal -- they became Intellectual Property >>because<< they were opaque and required a special library whose source was hidden to access the actual binary, which might be entirely undocumented (except via its API library calls). BECAUSE they were so bloomin' hidden an difficult/expensive to modify, software evolved very, very slowly, breaking like all hell every time e.g. MS Word went from revision 1 to 2 to 3 to... because of broken binary incompatibility. ASCII, OTOH, has the advantage of being (in principle) easy to read. However, it is easy to make it as obscure and difficult to read as binary. Examples abound, but let's pull one from /proc, since the entire /proc interface is designed around the premise that ascii is good relative to binary (although that seems to be the sole thing that the many designers of different subsystems agree on). When parsing the basic status data of an application, one can work through: rgb@lilith|T:105>cat /proc/1214/stat 1214 (pine) S 1205 1214 1205 34816 1214 0 767 0 872 0 22 15 0 0 15 0 0 0 14510 12034048 1413 4294967295 134512640 137380700 3221217248 3221190168 4294959106 0 0 134221827 1073835100 3222429229 0 0 17 0 0 0 22 15 0 0 (which, as you can see, contains the information on the pine application within which I am currently working on my laptop). What? You find that hard to read? Surely it is obvious that the first field is the PID, the second the application name (inside parens, introducing a second, fairly arbitrary delimiter to parse), the runtime status (which is actually NOT a single character, it can vary) and then... ooo, my. Time to check out man proc, kernel source (/usr/src/linux/fs/proc/array.c) and maybe the procps sources. One does better with: rgb@lilith|T:106>cat /proc/1214/status Name: pine State: S (sleeping) Tgid: 1214 Pid: 1214 PPid: 1205 TracerPid: 0 Uid: 1337 1337 1337 1337 Gid: 1337 1337 1337 1337 FDSize: 32 Groups: 1337 0 VmSize: 11752 kB VmLck: 0 kB VmRSS: 5652 kB VmData: 2496 kB VmStk: 52 kB VmExe: 2804 kB VmLib: 3708 kB SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 8000000008001003 SigCgt: 0000000040016c5c CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 This is an almost human readable view of MUCH of the same data that is in /proc/stat. Of course there is the little ASCII encoded hexadecimal garbage at the bottom that could make strong coders weep (again, without a fairly explicit guide into what every byte or even BIT in this array does, as one sort of expects that there are binary masked values stuck in here). In this case man proc doesn't help -- because this is supposedly "human readable" they don't provide a reference there. Still, some of the stuff that is output by ps aux is clearly there in a fairly easily parseable form. Mind you, there are still mysteries. What are the four UID entries? What is the resolution on the memory, and are kB x1000 or x1024? What about the rest of the data in /proc/stat (as there are a lot more fields there). What about the contents of /proc/PID/statm? (Or heavens preserve us, /proc/PID/maps)? Finally, what about other things in /proc, e.g.: rgb@lilith|T:119>cat /proc/stat cpu 3498 0 2122 239197 cpu0 3498 0 2122 239197 page 128909 55007 swap 1 0 intr 279199 244817 13604 0 3427 6 0 4 4 1 3 2 2 1436 0 15893 0 disk_io: (3,0):(15946,11130,257194,4816,109992) ctxt 335774 btime 1066170139 processes 1261 Again, ASCII yes, but now (count them) there are whitespace, :, (, and ',' separators, and one piece of data (the CPU's index) is a part of a field value (cpu0) so that the entire string "cpu" becomes a sort of separator (but only in one of the lines). An impressive ratio of separators used to field labels. I won't even begin to address the LIVE VILE EVIL of overloading nested data structures nested in sequential, arbitrary separators inside the "values" for a single field, disk_io (or is that disk_io:?) If this isn't enough for you, consider /proc/net/dev, which has two separators (: and ws) but is in COLUMNS, /proc/bus/pci/devices (which I still haven't figured out) and yes, the aforementioned sensors interface in /proc. I offer all of the above as evidence of a fairly evil (did you ever notice how evil, live, vile, veil and elvi are all anagrams of one another he asks in a mindless parenthetical insertion to see if you're still awake:-) middle ground between a true binary interface accessible only through library calls (which can actually be fairly clean, if one creates objects/structs with enough mojo to hold the requisite data types so that one can then create a relatively simple set of methods for accessing them) and xml. XML is the opposite end of the binary spectrum. It asserts as its primary design principle that the objects/structs with the right kind of mojo share certain features -- precisely those that constitute the rigorous design requirements of XML (nesting, attributes, values, etc). There is a fairly obvious mapping between a C struct, a C++ object, and an XMLified table. It also asserts implicitly that whether or not the object tags are chosen to be human readable (nobody insists that the tags encapsulating CPU temperature readings be named <cpu_temperature> -- they could have been just <t>) there MUST be some sort of dictionary created at the same time as the XML implementation. If (very) human readable tags are chosen they are nearly self-documenting, but whole layers of DTD and CSS and so forth treatment of XML compliant markup are predicated upon a clear definition of the tag rules and hierarchy. Oh, and by its very design XML is highly scalable and extensible. Just as one can easily enough add fields into a struct without breaking code that uses existing fields, one can often add tags into an XML document description without breaking existing tags or tag processing code (compare with adding a field anywhere into /proc/stat -- ooo, disaster). This isn't always the case in either case -- sometimes one converts a field in a struct into a struct in its own right, for example, which can do violence to both the struct and an XML realization of it. Still, often one can and when one can't it is usually because you've had a serious insight into the "right" way to structure your data and before the encoding was just plain wrong in some deep way. This happens, but generally only fairly early in the design and implementation process. Note that XML need not be inefficient in transit. BECAUSE it is so highly structured, it compresses very efficiently. Library calls exist to squeeze out insignificant whitespace, for example (ignored by the parser anyway). I haven't checked recently to see whether compression is making its way into the library, but either way one can certainly compress/decompress and/or encrypt/decrypt the assembled XML messages before/after transmission, if CPU is cheaper to you than network or security is an issue. I think that it then comes down to the following. XML may or may not be perfect, but it does form the basis for a highly consistent representation of data structures that is NOT OPAQUE and is EASILY CREATED AND EASILY PARSED with STANDARD TOOLS AND LIBRARIES. When designing an XMLish "language" for your data, you can make the same kind of choices that you face in any program. Do you document your code or not? Do you use lots of variable names like egrp1 or do you write out something roughly human readable like extra_group_1? Do you write your loops so that they correspond to the actual formulae or basic algorithm (and let the compiler do as well as it can with them) or do you block them out to be cache-friendly, insert inline assembler, and so forth to make them much faster but impossible to read or remember even yourself six months after you write them? Some choices make the code run fast and short but hard to maintain. Other choices make it run slower but be more readable and easier to maintain. In the long run, I think most programmers eventually come to a sort of state of natural economy in most of these decisions; one that expresses their personal style, the requirements of their job, the requirements of the task, and a reflection of their experience(s) coding. It is a cost/benefit problem, after all (as is so much in computing). You have to ask how much it costs you to do something X way instead of Y way, and what the payoff/benefits are, in the long run. For myself only, years of experience have convinced me that as far as things like /proc or task/hardware monitoring are concerned, the bandwidth vs ease of development and maintenance question comes down solidly in favor of ease of development and maintenance. Huge amounts of human time are wasted writing parsers and extracting de facto data dictionaries from raw source (the only place where they apparently reside). Tools that are built to collect data from a more or less arbitrary interface have to be almost completely rewritten when that interface changes signficantly (or break horribly in the meantime). So the cost is this human time (programmers'), more human time (the time and productivity lost by people who lack the many tools a better interface would doubtless spawn), and the human time and productivity lost due to the bugs the more complex and opaque and multilayered interface generates. The benefit is that you save (as you note) anywhere from a factor of 3-4 to 10 or more in the total volume of data delivered by the interface. Data organization and human readability come at a price. But what is the REAL cost of this extra data? Data on computers is typically manipulated in pages of memory, and a page is what, 4096 bytes? Data movement (especially of contiguous data) is also very rapid on modern computers -- you are talking about saving a very tiny fraction of a second indeed when you reduce the message from 54 bytes to 4 bytes. Even on the network, on a 100BT connection one is empirically limited by LATENCY on messages less than about 1000 bytes in length. So if you ask how long it takes to send a 4 byte packet or a 54 byte packet (either one of which is TCP encapsulated inside a header that is longer than the data) the answer is that they take exactly the same amount of time (within a few tens of nanoseconds). If the data in question is truly a data stream -- a more or less continuous flow of data going through a channel that represents a true bottleneck, then one should probably use a true binary representation to send the data (as e.g. PVM or MPI generally do), handling endian translation and data integrity and all that. If the data in question is a relatively short (no matter how it is wrapped and encoded) and intermittant source -- as most things like a sensors interface, the proc interface(s) in general, the configuration file of your choice, and most net/web services are, arguably -- then working hard to compress or minimally encapsulate the data in an opaque form is hard to justify in terms of the time (if any) that it saves, especially on networks, CPUs, memory that are ever FASTER. If it doesn't introduce any human-noticeable delay, and the overall load on the system(s) in question remain unmeasurably low (as was generally the case with e.g. the top command ten Moore's Law years or more ago) then why bother? I think (again noting that this is my own humble opinion:-) that there is no point. /proc should be completely rewritten, probably by being ghosted in e.g. /xmlproc as it is ported a little at a time, to a single, consistent, well documented xmlish format. procps should similarly be rewritten in parallel with this process, as should the other tools that extract data from /proc and process it for human or software consumption. Perhaps experimentation will determine that there are a FEW places in /proc where the extra overhead of parsing xml isn't acceptable for SOME applications -- /proc/pid/stat for example. In those few cases it may be worthwhile to make the ghosting permanent -- to provide an xmlish view AND a binary or minimal ASCII view, as is done now, badly, with /proc/pid/stat and /proc/pid/status. This is especially true, BTW, in open source software, where a major component of the labor that creates and maintains both low level/back end service software and high level/front end client software is unpaid, volunteer, part time, and of a wide range of skill and experience. Here the benefits of having a documented, rigorously organized, straightforwardly parsed API layer between tools are the greatest. Finally, to give the rotting horse one last kick, xmlified documents (deviating slightly from API's per se) are ideal for archival storage purposes. Microsoft is being scrutinized now by many agencies concerned about the risks associated from having 90% of our vital services provided by an operating system that has proven in practice to be appallingly vulnerable. Their problem has barely begun. The REAL expense associated with using Microsoft-based documents is going to prove in the long run to be the expense of de-archiving old proprietary-binary-format documents long after the tools that created them have gone away. This is a problem worthy of a rant all by itself (and I've written one or two in other venues) but it hasn't quite reached maturity as it requires enough years of document accumulation and toplevel drift in the binary "standard" before it jumps out and slaps you in the face with six and seven figure expenses. XMLish documents (especially when accompanied by a suitable DTD and/or data dictionary) simply cannot cost that much to convert because their formats are intrinsically open. </rant> rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:email@example.com
This page was created using Emacs
Last modified: 25 April, 2003Back to Anton Feenstra Homepage