topical media & game development
from Leonardo Electronic Almanac volume 11, number 5, May 2003
SOFTWARE DEVELOPMENT PLATFORMS FOR LARGE DATASETS: ARTISTS AT THE API
By Brett Stalbaum, beestal@cadre.sjsu.edu
In 1998, C5 had a problem; two problems, actually.
That year, we had organized as a business without a model to do a data
collection and analysis project at SIGGRAPH 98, called the Remote
Control
Surveillance Probe project [1]. The impetus for the founding of C5 was
to
see what kinds of business opportunities were available to a
collaborative
group of artists and theorists, already working for many years with
information as our primary medium. The expertise of C5 members was
brought
under one umbrella to tackle problems in domains relative to our
collective
experience, which included autopoietic theory, artificial
intelligence,
information systems design and programming, public relations, emergent
behavioral systems, semiotics, literary criticism, military studies,
library
science and fine art.
Shortly after organizing, we were invited by Steve Dietz of the Walker
Art
Center in Minneapolis to do a net-art project related to a work by
C5's
president, Joel Slayton - *Not to See a Thing.* The project had been
exhibited as part of the 1997-98 exhibition, "Alternating Currents:
American
Art in the Age of Technology," at the San Jose Museum of Art, in
collaboration with the Whitney Museum of American Art [2]. The *Not to
See a
Thing* project collected about 10 gigabytes of information about
audience
participation with the work during the time it was installed in the
SJMA.
What Steve Deitz was interested in was how we might hybridize the *Not
to
See a Thing* data with the infrastructure of the Internet itself to
create a
net-art project. This in essence created our two problems.
On the one hand, we had a fairly large but still manageable set of
biometric
data from Slayton's installation, which we had to mingle with the
tremendous
infrastructure of the Internet itself. And of course we had to find a
way
to make the manifestation of that data-mingling visible and navigable
to the
user. Thus the first problem was related to the size of the datasets
and the
need to develop a strategy for exploring them and exposing something
about
them. The second problem was that we were faced with two large sets of
data
that were superficially unrelated to one another. Our efforts
culminated in
the *16 Sessions* project [3] and the realization of the C5 IP [4]
database
that Lisa Jevbratt developed to facilitate the mingling between the
*Not to
See a Thing* data and IP space [5]. This article focuses on the
strategies
that emerged from these projects and how they inform the matter of how
artists can and should contribute solutions to these kinds of
problems.
I will begin with the scale problem first, because it is the less
interesting of the two, and the solution is more obvious. The question
is
"How do you create a context in which information artists with
different
experiences and different sets of IT skills can participate in the
exploration of and experimentation with large data sets?" We believe
it is
important to create a context that is amiable to both collaboration
and
independent endeavor at a variety of interface levels.
Technically, this requires the development of multiple interfaces to
the
data that are congruent with the experience of the various groups of
people
who will be working with it. To ensure this, whenever possible,
artists
should be involved with or completely responsible for the development
of the
various interfaces. Given that artists today are also computer
programmers,
database administrators, information architects, engineers and
theorists, it
is important that the data to be worked with be arranged for maximum
access;
access that ranges from the raw data (files or database interface) all
the
way through standard user interfaces that highly mediate access to the
data
through end visualizations at the presentation layer. In between these
extremes, artists should have access to all of the APIs [6] and
middleware
layers and preferably be responsible for the development of these
layers.
Working on "16 Sessions" and in subsequent software projects such as
*SoftSub* [7], C5 had in place people with experience in all of these
layers
of software development and, importantly, experience working with each
other, so the process was relatively smooth. Of course, this is not
the
situation with larger sets of institutionally collected data, where
the
standards, data formats and APIs can often be quite obtuse [8].
Different challenges exist with the emergence of large collections of
public
data, such as those available from the United States Geological
Survey,
NASA, NOAA and the Human Genome Project. Such challenges are not only
presented by the technical sophistication of the data and the
tremendous
size of the data, but in strategizing appropriate interfaces to the
data
that allow users of very diverse backgrounds to participate in the
process
of consuming the data and generating new knowledge from it.
C5 has been active in this area. For example, the C5 Landscape
database is a
relational database, Perl API and set of sample interfaces designed
specifically to help users in creating their own programs that can
easily
access, analyze and display information about the shape of the earth
[9].
The database is designed to eliminate much of the complexity in
acquisition,
database interface, processing and imaging common in the manipulation
of
geo-data, so that artists have a manageable platform in which to write
their
own software and perform mapping experiments. Artists using the
software can
work with the database from various levels of technical
sophistication.
These levels range from a web-based GUI to browse the dataset to the
ability
to write their own code to access the database directly through SQL,
Perl
DBI and Java JDBC programming techniques. An API also provides a
variety of
features and capabilities through easy-to-use Perl modules.
There are, of course, many projects that incorporate the idea of
artists
working with data at all levels. Especially notable are Lisa
Jevbratt's
*Mapping the Web Infome* [10] and Rhizome's *alt.interface* projects
[11].
The *alt.interface* project involves exposing (to artists) the
database API
of the Rhizome website and its large text object collection, such that
they
can create alternative interfaces. Jevbratt's web-crawling project is
especially notable because of the way that she worked with the invited
artists to create both an interface for the "alternative" technical
artists
involved, as well as working at the database and API levels with many
of the
artists to collaboratively implement features suggested by artists.
It is appropriate for artists to be involved in the development of the
public APIs and application layer interfaces through which the public
at
large will have access to large data, because in many cases artists
working
collaboratively already have experience in working out the inherent
interface issues that are involved in making data available to
"technically
diverse" or even non-technical users. Artists in both new-media
academia and
fine-art practice have been involved in this kind of work for many
years.
The second issue is a deeper one, involving how artists have
contributed and
can contribute to dealing with inter-relations between very different
datasets, as well as unexplored intra-relations within single large
datasets
of considerable complexity. The exploration of large datasets is one
of the
most provocative and interesting issues for artists today because of
the
explosion of availability of such large data sets being made available
to
the public.
Why? Artists as cultural workers have always sought to contribute to
the
state of our knowledge near the edges of human understanding. Among
the new
cultural problems we face today are the problems of big data. And lest
you
assume that this is exclusively the domain of computer science, the
large
datasets of today present new kinds of problems that computers and
networks
are not traditionally used to solve, and are perhaps even unable to
solve.
The familiar notion of the "information-processing life cycle" is the
basis
of contemporary data-processing. This is the very colonial idea that
data is
something raw and primitive that needs to be tamed in order to become
useful. The notion holds that data must be processed into useful
information, and to accomplish this you normally start by considering
the
output you want, the available input, and then determine the algorithm
that
will take your raw and untreated data and turn it into a manageable,
cognizable, useful thing we call information. The entire field of data
mining and knowledge management, as we know it today, is predicated on
the
pre-existence of semantic models that allow data to be algorithmically
mined
for meaning. This is the basic philosophy and approach to data and
information and is, of course, profoundly successful, but its
application
reaches severe limitations in dealing with contemporary data and the
new
kinds of problems it presents.
For example, traditional problem-solving is not at all applicable to
the
situation C5 faced with *16 Sessions.* We had two very different data
sets,
and although we had some preconceptions of what they meant, we had no
idea
how they were related or even if they were related and no clear idea
of what
kind of question to ask. Neither set of data was collected with a
protocol
that was designed to facilitate the type of endeavor that we were
charged
with performing. Again, standard information-processing techniques are
not
useful for all problems, especially when you do not have a question,
when
you have a poorly formed question, or when the dataset itself is not
entirely understood or contains information potentials that were
unplanned
at the time it was collected. Data may have non-transparent semantics,
or
may be so complicated that you do not know where to begin to search,
or it
may take on new roles as new needs emerge after the data is collected.
These issues are of course also related to the problem of what
questions to
ask. When you don't understand your data, you will naturally have
poorly
formed questions about it.
Why is this a critical problem? The answer is that there is ever more
data
being collected in various endeavors about which we do not know what
questions to ask. For example, the Human Genome Project has sequenced
and
published the entire human genome, but that tremendous data set is
largely
unexplored, because in part, scientists have not sought the answers to
questions not yet raised. While this may seem quite tautologically
obvious,
it is simultaneously a tremendous and real problem. As put by Lisa
Jevbratt,
the process of exploring genomic data can be "described as that of a
group
of people in a dark room fumbling around not knowing what is in the
room,
how the room looks or what they are looking for." Genomic data is not
unique
in this respect. There are, for example, vast datasets available from
the
United States and other governments regarding all kinds of interesting
things that we do not yet fully understand, or things that we think we
understand but which have behavior and relations that have been
overlooked.
Furthermore, artists, who do not always participate in the scientific
method, may well make discoveries or observations in their aesthetic
and
conceptual pursuits with such data that lead to such questions, even
if the
artists are participating as blind probe-heads in data space.
The exploration of such data, I argue, is the most productive and
culturally
useful position from which to perform as an artist in the twenty-first
century. It is hard now to make interesting art without pursuing the
solution to an interesting problem and being faced with large sets of
data
with neither a map nor a clearly defined problem. Definition is one of
the
most interesting and provocative problem-types we face in an era where
our
ability to collect data outpaces our ability to generate knowledge
from it.
Asking questions and exploring spaces in poorly defined problem
domains
consisting of huge datasets is the natural, useful and potentially
highly
productive cultural role in which artists should play a part.
C5's approach to these types of problems is to explore the application
of
autopoiesis as a conceptual framework for understanding the behavior
of data
and information. Autopoiesis takes place in systems that differentiate
themselves from other systems on a continual basis through operational
closure, and that produce and replace their own components in the
process of
interaction with their environment (structural coupling). This process
occurs via a membrane containing the organization of the unity in
question,
thus allowing distinction between it and its environment.
A basic question for any analysis of the autopoietic potentials of
data
involves distinguishing a membrane, or the interface, where
operational
closure (inside) and structural coupling with an environment (outside)
are
expressed. It is in patterns of structural coupling that relations
between
complex data can be analyzed. If you can find a membrane, you have
revealed
a relation between or within data sets. To find membranes, you need to
mingle data. For example, there are contemporary explorations within
the
social sciences that demonstrate that relations exist between data
sets
collected for quite disparate reasons. Global information systems
containing
information about the landscape (for example drainage, land cover or
topography) can reveal insights when mingled with historical data
[12]. C5
views these types of data-processing explorations as very interesting
instances of structural coupling [13] between data sets, even those as
superficially different as geological and historical data.
Most of C5's approach to autopoietic frameworks for the understanding
of
large data has been developed by Joel Slayton and Geri Wittig. Perhaps
the
key idea that emerges from their work is the notion of a composibility
of
relations [14], in that composibility indicates the potential for
autopoietic membranes existing as data relations via third-order
structural
coupling in a coded environment. This allows for the analysis of data
sets
where the semantic relationships are uncertain. In a sense, this idea
can be
described as the search for algorithms in which superficially
different data
sets might be shown to couple based on their subject-less form through
inherent sans-semantic or pre-semantic models, and to seek these
relations
specifically to flag the potential for the presence of immanent,
unplanned
or otherwise unrecognized semantics flowing from mingled relations,
thus
revealing something about the ontology of the sets that produces new
knowledge about them. It is unlikely that there is a universal
algorithm for
this, (such as a universal visualization system for all data), but if
there
is, it is likely to be accidentally discovered by researchers
searching for
inter-relations between data sets. Obviously, artists should be
involved in
this endeavor.
This is only one approach, undertaken by a small, self-funded
organization
that believes that a very particular theoretical framework can be
expressed
in coded relations that deliver their own answers. To explore this, we
of
course need a lot of data. It is important that science organizations
create
the circumstances that will allow a diversity of independently
theorized
approaches to emerge based on public interest in and public access to
the
data [15]. It is in casting large sets of scientific data into the
realm of
artists and, indeed, the public at large, that will allow a multitude
of
self-organized modes of discovery to develop.
REFERENCES
- 1. www.c5corp.com/projects/rcsp/index.shtml
- 2. www.c5corp.com/walker/gateway.html
- 3. www.c5corp.com/projects/16sessions/index.shtml
- 4. The Internet protocol is the numerical addressing scheme used to
identify
devices on the Internet.
- 5. This later became the technical basis for 1:1,
www.c5corp.com/projects/1to1/index.shtml
- 6. API is the acronym for "application programming interface," which
is a
group of public functions that exist in a library of code that other
programmers can make use of to implement their own code. Artists
should
design APIs as well as use them.
- 7. www.c5corp.com/softsub/index.shtml
- 8. A good example of this is the Spatial Data Transfer Standard.
According
to computer scientist Gregg Townsend, "The adoption of SDTS was a
giant step
backwards. While previous DEM files could be read by relatively simple
programs, SDTS file are difficult to read even with the help of a
large
external library." www.cs.arizona.edu/topovista/sdts2dem/
- 9. cadre.sjsu.edu/~gis
- 10. dma.sjsu.edu/jevbratt/lifelike/
- 11. rhizome.org/interface/
- 12. For an example, see fisher.lib.virginia.edu/projects/salem/
*The
GIS of "Salem Village in 1692"* is part of an electronic research
archive of
primary source materials related to the Salem witch trials of 1692.
- 13. Wittig, Geri, *Expansive Order Situated and Distributed Knowledge
Production in Network Space*,
www.c5corp.com/research/situated_distributed.shtml
- 14. Slayton, Joel and Wittig, Geri, "Ontology of Organization as
System," in
*Switch* - the new media journal of the CADRE digital media
laboratory, Fall
1999, Vol. 5, No. 3, switch.sjsu.edu/web/v5n3/F-1.html
- 15. cse.ssl.berkeley.edu/nvo/nvo.htm
(C) Æliens
04/09/2009
You may not copy or print any of this material without explicit permission of the author or the publisher.
In case of other copyright issues, contact the author.