Public Domain Image Databases for Taxonomy

HOME | Databases | Information | Biology and the Internet
Join Forum: Taxonomy Initiatives for Biodiversity Conservation in an IT Era
January 13-14, 2001

Public Domain Image Databases for Taxonomic Research and Education: A Case Study, Protist Image Database

*Yuuji Tsukii¹, Akira Kihara¹, Yoshihiro Ugawa² (*Contact Person)
¹Laboratory of Biology, Science Research Center, Hosei University, Fujimi, Chiyoda-ku, Tokyo 102-8160, Japan; ²Environmental Education Center, Miyagi University of Education, Aramaki-Aoba, Aoba-ku, Sendai 980-0845, Japan

Abstract

DNA databases compiling research resources (i.e., sequence data) are now indispensable for genome sciences. Accordingly, databasing vast amount of taxonomic resources, most of which are specimen's images and relevant descriptions, is highly promising for taxonomic research and education.
But, in contrast to the DNA databases, it is almost impossible to centralize varied taxonomic resources into a few databases. Thus, taxonomy databases will grow as a distributed public domain database (Green, 1994), where each database should be maintained as a "volunteer database" by individual researchers.
Protist image database have been constructed as such volunteer database aimed to provide protist images and other related information as research and educational resources via the Internet. Currently more than 18,000 images and their taxonomic descriptions covering 401 genera and 1398 species can be used by downloading from our web site (URL: http://protist.i.hosei.ac.jp/index.html). In addition, we have been consulting or assisting other volunteer databases on taxonomy of various organisms.
In the course of these activities, we found that the Internet lacks both quality-control and preservation systems essential for scholarly communication. To establish volunteer databases as academic resources, they must be qualified and preserved by public organizations just like journals stored in public libraries.

Keywords: distributed public domain database, volunteer database, biological images, taxonomy, protists,

Introduction

Biology, especially taxonomy, relies on activity of compiling a large amount of observational experiences, most of which should be recorded as images (photo copies or drawings). However, only a small part of those images have been published on academic journals due to page limitation, and the remainings have been unused and finally scrapped after retirement of researchers. If such images (and their relevant descriptions) were databased and opened for public uses on the Internet, those databases will promote biological, especially taxonomic research and education just like present-day DNA databases.

But, in contrast to the DNA databases, it must be impossible to centralize such large amount of images with various sizes and formats into a few databases. Thus, the biological image databases will inevitably grow as a distributed public domain database (Green, 1994), where each database should be maintained as a "volunteer database" by individual researchers.

Needs for image databases in protist taxonomy

We have been constructing "protist image database" as one of those volunteer databases since 1995 (Fig.1, URL: http://protist.i.hosei.ac.jp/index.html). After publicizing our databases on the Internet, we found that image databases will play an important role for protist taxonomy. Because, as many protist species lack "type specimens" due to the difficulties of their preservation, protist taxonomy have mostly relied on drawings and their descriptions, causing confusions or inefficiency of species identification. In this situation, databases compiling images of many protists will partly, if not all, compensate the lack of type specimens by helping identification of species and serving as resources for taxonomic research in this group.

Furthermore, many protist species are cosmopolitan, i.e., their distributions are worldwide. This means that many species will be found in samples collected even at small local area. Therefore, to identify protist species, we must always carry a large taxonomy guidebook describing all known species of protists, though it is actually impossible.

Thus, databasing protist images and their taxonomic descriptions and publicizing such databases through the Internet will help people who want to know the species name of protists as well as protist taxonomists themselves.

Basic features of Protist image database

Protist image database was originally aimed to collect images (photo copies) and other research resources which have been kept and unused by us and other researchers (mostly protozoologists). Thus, our database firstly became consisting of images of a limited number of species which are used in laboratories, such as Paramecium or Amoeba.

However, after publicizing the database we found most users are not protistologists but researchers in other fields or non-researchers, i.e., school teachers or people working at Water Service or at companies concerning environmental assessment, and so on, and they all want to know basic taxonomy of protists.

Therefore, we have shifted the main purpose of our database from collecting mere research resources to compiling resources on protist taxonomy (i.e., images of many protists collected from field and their relevant descriptions from many references). At present, more than 18,000 images and their taxonomical descriptions covering 401 genera and 1398 species can be used by downloading from our web site.


Fig. 1 Protist Image Database, English menu URL; http://protist.i.hosei.ac.jp/index.html	Fig. 2 A sample of "genus" webpages In this webpage, names of species belonging to a genus, Euglena are listed with their representative thumbnail images.

All images are basically presented by each cell (specimen) and classified according to taxonomic order (species, genus, family, etc.). For example, in a "genus" webpage, names of species belonging to the genus are listed with their thumbnail images (Fig. 2), where the species names are linked to each "species" webpage. In the "species" webpage, thumbnails of specimens belonging to the species are listed (Fig. 3), which finally lead to basic "specimen" webpages containing many images of each specimen (Fig. 4).


Fig. 3 A sample of "species" webpages In this webpage, specimens belonging to a species, Euglena spirogyra, are listed with their thumbnail images. Species names under the thumbnails are linked to each speciemen's webpage (Fig. 4).	Fig. 4 A sample of "specimen" webpages By clicking on thumbnail images listed, users can select four types of images with different magnifications (see Fig. 5).

In our database, images are mostly digitized from 35 mm reversal films using PhotoCD service. Each image consists of five graphical files (JPEG files) with different magnifications, i.e., thumbnail-image files of 96 x 64 pixels, and its enlarged-image files of 192 x 128, 384 x 256, 768 x 512, and 1536 x 1024 pixels. The smaller files are provided for observing images on the monitor which has low resolution (usually 72 dpi) and the larger ones for printing which usually require higher resolutions (300 or 600 dpi or more) (Fig. 5). These sets of image files with various magnifications will be a essential requisite for academic resource databases.

Fig. 5 A webpage showing an enlarged image of Euglena spirogyra
This image is in size of 768 x 512 pixels. By clicking thumbnail images, users can select other images of different sizes, 192 x 128, 384 x 256, and 1536 x 1024 pixel.

Our database presently consists of totally about 110,000 files (90,000 graphical files and 20,000 text files in html format), which occupy about 3.2 G bytes on our server machine.

Contributors and Collaborators

We are not only providing our own images through the Internet, but also accepting contributions of images from users who are working on protists (URL, http://protist.i.hosei.ac.jp/PDB/contributors_E.html). We are also welcoming other types of user's collaborations, i.e., comments or corrections on our webpages and help for identifying species name of images which have been unidentified by ourselves. These cooperations between database managers like us and its users will serve as quality-control system in volunteer databases.

How people are using our database

In addition to the user's contributions and collaborations mentioned above, we are accepting various requests or questions from users via e-mail ([email protected]). Requests from users are for permission of use of images in their webpages or in printing (mostly textbooks or CD-ROM), or at poster presentations or in papers (e.g. master thesis). Others are requests for species identification or supplying various strains of protists or questions on culturing protists, etc.

On one hand, we have been distributing CD-ROMs copying a part of our database upon user's requests with no charge (in Japan only; version 1 was pressed 1,000 copies in 1995, version 2 pressed 5,000 copies in 1996 and version 3 pressed 10,000 copies in 1998). Our CD-ROM contains only one type of enlarged-image files (768 x 512 pixels) because of smaller capacity of CD-ROM (640 M bytes). However, as number of images increased, a single CD-ROM became insufficient to contain all images even if limited the enlarged-image files for only one type. So, we are now planing for the distribution of our database contents on DVD-ROM.

Our CD-ROM distribution service has various benefits for both researchers making databases and its users. For the researchers, as CD-ROM can be treated like printings in library, CD-ROM publishing would be a best choice for preserving database contents under present conditions that there is still no public organizations for preserving voluntary-delivered (or self-published) information, which will be discussed later. And then, it may lead to an evaluation for our databasing activity as scientific career. For users, as the databases containing many image files tend to big size, off-line access using CD-ROM is better than on-line access through low-speed computer network.

Table 1 The state of CD-ROM distribution

Group	No. of Users	No. of CD

Researchers at	263	1598
University	167
Other Institutes	56
Private Company	40
Teachers at	168	4372
High school	98
Middle school	40
Primary school	9
Others	20
Company, etc.	46	105
Misc.	65	556
Undergraduate students	29
High school students	5
Others	31
Unkowns	87	216

Total	629	6847

(2001.1.29)

On receipt of requests by e-mail, we asked user's occupations and for what purposes they want to use our CD-ROM.

Currently, our CD-ROM (version 3) has been distributed to more than 600 people, and the number of CD-ROM distributed reached about 7000 (Table 1). To save distribution costs, we are asking users to cooperate for secondary distribution, so that the 600 users helped us to distribute 7000 CD-ROMs to other users. Of the 7000 CD-ROMs, about 4400 CDs were distributed by 168 school teachers to their colleagues within their communities at prefecture- or city-level.

Consulting other volunteer databases

Beside constructing our own database, we have been consulting other volunteer databases (Tsukii et al., 1995). Since 1997, we have been working on a project, "Construction of Biological Image Databases" (or shortly "Soken-Taxa project", URL; http://taxa.soken.ac.jp/) at the Graduate University for Advanced Studies, where we are consulting or assisting construction of image databases on various organisms as follows:

1. Japanese Ant Color Image Database
URL; http://taxa.soken.ac.jp/Ant.WWW/INDEXE.HTML
2. Marine Mammal Stranding Database
URL; http://svrsh1.kahaku.go.jp/index.htm
3. Mammalian Crania Picture Archive
URL; http://1kai.dokkyomed.ac.jp/mammal/en/mammal.html
4. Mouse Image Database
URL; http://mouse.miyazaki-med.ac.jp:591/mouse1/
5. Morning Glories Database
URL; http://taxa.soken.ac.jp/Asagao/Yoneda/menu.html
6. Makino Type Specimen Database
URL; http://wwwmakino.shizen.metro-u.ac.jp/database.htm

Recently, many other volunteer image databases on taxonomy of various organisms are arising here in Japan.

Research on support systems for volunteer "bio-resource" databases

In addition to those consultations for other volunteer databases one by one, we are now developing more generalized supporting systems for databasing and publicizing biological research resources. Since 1997, we have joined another project, "Fundamental research and development for databasing and networking culture collection information" (shortly "Bio-Resource project", URL; http://bio.tokyo.jst.go.jp/biores/index.htm) at JST (Japan Science and Technology Corporation).

In this project, we have developed various support systems for biologists to make databases by themselves, which are accessible through the Internet. Those systems are:
1) Optimized procedures for digitizing and assembling still images into a database; 2) Systems for making on-line movie databases, including optimization for the techniques of digitizing movies, methods for compression and decompression of the movies, and construction of the server for the movie databases, etc.; 3) A method for making WWW-browsable "digital image book" which will make easier to read rare but important books or papers.; 4) Editing manuals for the maintenance and the management of the databases, which will be published by printing and web pages (Tsukii & Kihara, 1999).

In the course of these activities, we gradually became aware of an important defect of the Internet with regard to scholarly communications, which make researchers to hesitate publicizing their own resources on the net. The defect is lack of public systems for qualifying and preserving information voluntary-delivered (or self-published) by researchers on the Internet.

Printing vs Internet

One of the basic features of academic information is that they are permanently preserved at public organizations such as university libraries. Before the Internet era, information produced by researchers was written on papers as manuscripts, and then submitted to journals, where the information was qualified by peer review system. After the qualification, the information was publicized worldwide via printing. Though most journals bought by individuals will be eventually lost, those bought by public institutions (e.g. university libraries) will be kept for long time to serve as references for researchers and others in future.

In other words, academic information publicized by printing have been in a well-established social system, i.e., 1) production of information by researchers, 2) their publicization (publication) by publishers after quality control, and 3) their preservation by librarians (Fig. 6).

On the other hand, the Internet as a "new media" is basically different from printing as a "mass media", that is, researchers, or actually anyone, can be both "producers" and "publishers" of their information through the Internet. This will promote information exchange not only within scientist community but also between scientists and other people. However, there is at present neither quality-control nor preservation systems for those information publicized through the Internet, except for genome information.


Fig. 6 Printing vs Internet
In printing, academic information is in a well- established system; production of the information by researchers, quality-control and publicization by journal publishers, and permanent preservation by university libirarians. In contrast, in the Internet, researchers are able to	not only produce but also publicize their information. But, there is still no public organizations ("Support centers" in this figure) for qualifying and preserving such information, so that information publicized on the net can not be at present used as academic resources.

In case of genome sciences, all sequence data are centralized into a few computers and maintained by specialists, where the quality control is ensured by the cooperation between DNA database centers and journal publishers where researchers submit their papers analyzing their sequences. And the sequences data will be preserved by the DNA database centers with government supports. This quality-control and preservation system for genome information functions like those of printing (journals), and evaluates sequencing works by genome researchers as their scientific career.

Contrary, centralization of other biological resources such as images is actually impossible as already mentioned, and therefore, they should be databased and publicized on the Internet by researchers themselves. However, the Internet does not have systems for qualifying and preserving such voluntary-delivered information. This situation makes researchers unwilling to publicize their own resources via the Internet, because their works can not be evaluated as scientific career.

Public organization for quality-control and preservation of the voluntary-delivered biodiversity resources

To establish volunteer databases as academic resources, therefore, they must be qualified and permanently preserved by public organizations like journals stored in public libraries (Fig. 6). For example, databases or any other information publicized through the Internet by individuals or its groups (mostly researchers) can be qualified by a committee authorized by academic society or its alternatives. If they are qualified to have enough value as research and/or educational resources, the committee should issue "accession codes" for the database contents and, at the same time, the contents should be backuped by public organizations (e.g. "Support centers" in Fig. 6). When the original databases were updated, only the updated files should be added to the backup files by the centers. On this system, the "accession codes" can be used as references in papers similar to accession numbers in genome databases, and by checking the codes, users can access to the backuped contents even after the original databases (or web sites) disappeared from the net due to retirement of researchers or some other reasons.

These quality-control and preservation systems will be needed for all kinds of academic resources voluntary-delivered on the Internet. Establishing these systems are urgent especially in taxonomy where vast amount of biodiversity resources are demanded to be databased and opened for public uses.

Acknowledgements

Our research on databasing and publicizing of biological resources through the Internet is supported by the Bio-Resource project, "Fundamental research and development for databasing and networking culture collection information" (1997-2001) at JST (Japan Science and Technology Corporation) and by the Soken-Taxa project, "Construction of Biological Image Databases" (1997-1999) at The Graduate University for Advanced Studies. This work was also supported by a grant 07558052 (1995-1996) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References

1. Green, D.G. (1994) Databasing diversity - a distributed, public-domain approach, Taxon 43: 51-62. URL; http://life.csu.edu.au/~dgreen/papers/taxon.html
2. Tsukii, Y., Kihara, A., and Ugawa, Y. (1995) Distributed public domain databases (DPDD) of biological information on Internet: An introduction of a color image database for Japanese ants, Japanese Journal of Computer Science, 2: 5-13 (in Japanese). URL; http://protist.i.hosei.ac.jp/ProtistInfo/JJCS/E/index.htm (in English)