SOFTWARE DEVELOPMENT PLATFORMS FOR LARGE DATASETS: ARTISTS AT THE API

By Brett Stalbaum, beestal@cadre.sjsu.edu

In 1998, C5 had a problem; two problems, actually. That year, we had organized as a business without a model to do a data collection and analysis project at SIGGRAPH 98, called the Remote Control Surveillance Probe project [1]. The impetus for the founding of C5 was to see what kinds of business opportunities were available to a collaborative group of artists and theorists, already working for many years with information as our primary medium. The expertise of C5 members was brought under one umbrella to tackle problems in domains relative to our collective experience, which included autopoietic theory, artificial intelligence, information systems design and programming, public relations, emergent behavioral systems, semiotics, literary criticism, military studies, library science and fine art.

Shortly after organizing, we were invited by Steve Dietz of the Walker Art Center in Minneapolis to do a net-art project related to a work by C5's president, Joel Slayton - *Not to See a Thing.* The project had been exhibited as part of the 1997-98 exhibition, "Alternating Currents: American Art in the Age of Technology," at the San Jose Museum of Art, in collaboration with the Whitney Museum of American Art [2]. The *Not to See a Thing* project collected about 10 gigabytes of information about audience participation with the work during the time it was installed in the SJMA. What Steve Deitz was interested in was how we might hybridize the *Not to See a Thing* data with the infrastructure of the Internet itself to create a net-art project. This in essence created our two problems.

On the one hand, we had a fairly large but still manageable set of biometric data from Slayton's installation, which we had to mingle with the tremendous infrastructure of the Internet itself. And of course we had to find a way to make the manifestation of that data-mingling visible and navigable to the user. Thus the first problem was related to the size of the datasets and the need to develop a strategy for exploring them and exposing something about them. The second problem was that we were faced with two large sets of data that were superficially unrelated to one another. Our efforts culminated in the *16 Sessions* project [3] and the realization of the C5 IP [4] database that Lisa Jevbratt developed to facilitate the mingling between the *Not to See a Thing* data and IP space [5]. This article focuses on the strategies that emerged from these projects and how they inform the matter of how artists can and should contribute solutions to these kinds of problems.

I will begin with the scale problem first, because it is the less interesting of the two, and the solution is more obvious. The question is "How do you create a context in which information artists with different experiences and different sets of IT skills can participate in the exploration of and experimentation with large data sets?" We believe it is important to create a context that is amiable to both collaboration and independent endeavor at a variety of interface levels.

Technically, this requires the development of multiple interfaces to the data that are congruent with the experience of the various groups of people who will be working with it. To ensure this, whenever possible, artists should be involved with or completely responsible for the development of the various interfaces. Given that artists today are also computer programmers, database administrators, information architects, engineers and theorists, it is important that the data to be worked with be arranged for maximum access; access that ranges from the raw data (files or database interface) all the way through standard user interfaces that highly mediate access to the data through end visualizations at the presentation layer. In between these extremes, artists should have access to all of the APIs [6] and middleware layers and preferably be responsible for the development of these layers.

Working on "16 Sessions" and in subsequent software projects such as *SoftSub* [7], C5 had in place people with experience in all of these layers of software development and, importantly, experience working with each other, so the process was relatively smooth. Of course, this is not the situation with larger sets of institutionally collected data, where the standards, data formats and APIs can often be quite obtuse [8].

Different challenges exist with the emergence of large collections of public data, such as those available from the United States Geological Survey, NASA, NOAA and the Human Genome Project. Such challenges are not only presented by the technical sophistication of the data and the tremendous size of the data, but in strategizing appropriate interfaces to the data that allow users of very diverse backgrounds to participate in the process of consuming the data and generating new knowledge from it.

C5 has been active in this area. For example, the C5 Landscape database is a relational database, Perl API and set of sample interfaces designed specifically to help users in creating their own programs that can easily access, analyze and display information about the shape of the earth [9]. The database is designed to eliminate much of the complexity in acquisition, database interface, processing and imaging common in the manipulation of geo-data, so that artists have a manageable platform in which to write their own software and perform mapping experiments. Artists using the software can work with the database from various levels of technical sophistication. These levels range from a web-based GUI to browse the dataset to the ability to write their own code to access the database directly through SQL, Perl DBI and Java JDBC programming techniques. An API also provides a variety of features and capabilities through easy-to-use Perl modules.

There are, of course, many projects that incorporate the idea of artists working with data at all levels. Especially notable are Lisa Jevbratt's *Mapping the Web Infome* [10] and Rhizome's *alt.interface* projects [11]. The *alt.interface* project involves exposing (to artists) the database API of the Rhizome website and its large text object collection, such that they can create alternative interfaces. Jevbratt's web-crawling project is especially notable because of the way that she worked with the invited artists to create both an interface for the "alternative" technical artists involved, as well as working at the database and API levels with many of the artists to collaboratively implement features suggested by artists.

It is appropriate for artists to be involved in the development of the public APIs and application layer interfaces through which the public at large will have access to large data, because in many cases artists working collaboratively already have experience in working out the inherent interface issues that are involved in making data available to "technically diverse" or even non-technical users. Artists in both new-media academia and fine-art practice have been involved in this kind of work for many years.

The second issue is a deeper one, involving how artists have contributed and can contribute to dealing with inter-relations between very different datasets, as well as unexplored intra-relations within single large datasets of considerable complexity. The exploration of large datasets is one of the most provocative and interesting issues for artists today because of the explosion of availability of such large data sets being made available to the public.

Why? Artists as cultural workers have always sought to contribute to the state of our knowledge near the edges of human understanding. Among the new cultural problems we face today are the problems of big data. And lest you assume that this is exclusively the domain of computer science, the large datasets of today present new kinds of problems that computers and networks are not traditionally used to solve, and are perhaps even unable to solve.

The familiar notion of the "information-processing life cycle" is the basis of contemporary data-processing. This is the very colonial idea that data is something raw and primitive that needs to be tamed in order to become useful. The notion holds that data must be processed into useful information, and to accomplish this you normally start by considering the output you want, the available input, and then determine the algorithm that will take your raw and untreated data and turn it into a manageable, cognizable, useful thing we call information. The entire field of data mining and knowledge management, as we know it today, is predicated on the pre-existence of semantic models that allow data to be algorithmically mined for meaning. This is the basic philosophy and approach to data and information and is, of course, profoundly successful, but its application reaches severe limitations in dealing with contemporary data and the new kinds of problems it presents.

For example, traditional problem-solving is not at all applicable to the situation C5 faced with *16 Sessions.* We had two very different data sets, and although we had some preconceptions of what they meant, we had no idea how they were related or even if they were related and no clear idea of what kind of question to ask. Neither set of data was collected with a protocol that was designed to facilitate the type of endeavor that we were charged with performing. Again, standard information-processing techniques are not useful for all problems, especially when you do not have a question, when you have a poorly formed question, or when the dataset itself is not entirely understood or contains information potentials that were unplanned at the time it was collected. Data may have non-transparent semantics, or may be so complicated that you do not know where to begin to search, or it may take on new roles as new needs emerge after the data is collected. These issues are of course also related to the problem of what questions to ask. When you don't understand your data, you will naturally have poorly formed questions about it.

Why is this a critical problem? The answer is that there is ever more data being collected in various endeavors about which we do not know what questions to ask. For example, the Human Genome Project has sequenced and published the entire human genome, but that tremendous data set is largely unexplored, because in part, scientists have not sought the answers to questions not yet raised. While this may seem quite tautologically obvious, it is simultaneously a tremendous and real problem. As put by Lisa Jevbratt, the process of exploring genomic data can be "described as that of a group of people in a dark room fumbling around not knowing what is in the room, how the room looks or what they are looking for." Genomic data is not unique in this respect. There are, for example, vast datasets available from the United States and other governments regarding all kinds of interesting things that we do not yet fully understand, or things that we think we understand but which have behavior and relations that have been overlooked. Furthermore, artists, who do not always participate in the scientific method, may well make discoveries or observations in their aesthetic and conceptual pursuits with such data that lead to such questions, even if the artists are participating as blind probe-heads in data space.

The exploration of such data, I argue, is the most productive and culturally useful position from which to perform as an artist in the twenty-first century. It is hard now to make interesting art without pursuing the solution to an interesting problem and being faced with large sets of data with neither a map nor a clearly defined problem. Definition is one of the most interesting and provocative problem-types we face in an era where our ability to collect data outpaces our ability to generate knowledge from it. Asking questions and exploring spaces in poorly defined problem domains consisting of huge datasets is the natural, useful and potentially highly productive cultural role in which artists should play a part.

C5's approach to these types of problems is to explore the application of autopoiesis as a conceptual framework for understanding the behavior of data and information. Autopoiesis takes place in systems that differentiate themselves from other systems on a continual basis through operational closure, and that produce and replace their own components in the process of interaction with their environment (structural coupling). This process occurs via a membrane containing the organization of the unity in question, thus allowing distinction between it and its environment.

A basic question for any analysis of the autopoietic potentials of data involves distinguishing a membrane, or the interface, where operational closure (inside) and structural coupling with an environment (outside) are expressed. It is in patterns of structural coupling that relations between complex data can be analyzed. If you can find a membrane, you have revealed a relation between or within data sets. To find membranes, you need to mingle data. For example, there are contemporary explorations within the social sciences that demonstrate that relations exist between data sets collected for quite disparate reasons. Global information systems containing information about the landscape (for example drainage, land cover or topography) can reveal insights when mingled with historical data [12]. C5 views these types of data-processing explorations as very interesting instances of structural coupling [13] between data sets, even those as superficially different as geological and historical data.

Most of C5's approach to autopoietic frameworks for the understanding of large data has been developed by Joel Slayton and Geri Wittig. Perhaps the key idea that emerges from their work is the notion of a composibility of relations [14], in that composibility indicates the potential for autopoietic membranes existing as data relations via third-order structural coupling in a coded environment. This allows for the analysis of data sets where the semantic relationships are uncertain. In a sense, this idea can be described as the search for algorithms in which superficially different data sets might be shown to couple based on their subject-less form through inherent sans-semantic or pre-semantic models, and to seek these relations specifically to flag the potential for the presence of immanent, unplanned or otherwise unrecognized semantics flowing from mingled relations, thus revealing something about the ontology of the sets that produces new knowledge about them. It is unlikely that there is a universal algorithm for this, (such as a universal visualization system for all data), but if there is, it is likely to be accidentally discovered by researchers searching for inter-relations between data sets. Obviously, artists should be involved in this endeavor.

This is only one approach, undertaken by a small, self-funded organization that believes that a very particular theoretical framework can be expressed in coded relations that deliver their own answers. To explore this, we of course need a lot of data. It is important that science organizations create the circumstances that will allow a diversity of independently theorized approaches to emerge based on public interest in and public access to the data [15]. It is in casting large sets of scientific data into the realm of artists and, indeed, the public at large, that will allow a multitude of self-organized modes of discovery to develop.

REFERENCES

1. www.c5corp.com/projects/rcsp/index.shtml
2. www.c5corp.com/walker/gateway.html
3. www.c5corp.com/projects/16sessions/index.shtml
4. The Internet protocol is the numerical addressing scheme used to identify devices on the Internet.
5. This later became the technical basis for 1:1, www.c5corp.com/projects/1to1/index.shtml
6. API is the acronym for "application programming interface," which is a group of public functions that exist in a library of code that other programmers can make use of to implement their own code. Artists should design APIs as well as use them.
7. www.c5corp.com/softsub/index.shtml
8. A good example of this is the Spatial Data Transfer Standard. According to computer scientist Gregg Townsend, "The adoption of SDTS was a giant step backwards. While previous DEM files could be read by relatively simple programs, SDTS file are difficult to read even with the help of a large external library." www.cs.arizona.edu/topovista/sdts2dem/
9. cadre.sjsu.edu/~gis
10. dma.sjsu.edu/jevbratt/lifelike/
11. rhizome.org/interface/
12. For an example, see fisher.lib.virginia.edu/projects/salem/ *The GIS of "Salem Village in 1692"* is part of an electronic research archive of primary source materials related to the Salem witch trials of 1692.
13. Wittig, Geri, *Expansive Order Situated and Distributed Knowledge Production in Network Space*, www.c5corp.com/research/situated_distributed.shtml
14. Slayton, Joel and Wittig, Geri, "Ontology of Organization as System," in *Switch* - the new media journal of the CADRE digital media laboratory, Fall 1999, Vol. 5, No. 3, switch.sjsu.edu/web/v5n3/F-1.html
15. cse.ssl.berkeley.edu/nvo/nvo.htm

[] readme course(s) preface I 1 2 II 3 4 III 5 6 7 IV 8 9 10 V 11 12 afterthought(s) appendix reference(s) example(s) resource(s) _

You may not copy or print any of this material without explicit permission of the author or the publisher. In case of other copyright issues, contact the author.