The following rare book was made available to the public electronically
through a unique collaboration between the
UC Berkeley Digital Library Project,
BSCIT,
XEROX PARC, and
The Jepson Herbarium.
A Flora of California
Jepson's A Flora of California is an important, insightful,
scholarly milestone still much in demand by scholars. This richly
illustrated botanical masterpiece was published in several
volumes spanning decades, but sadly never fully completed before Jepson's death.
The volumes published are a cornucopia of information gathered over Jepson's lifetime
on the plants of California in the early 20th century.
Systematic botany differs from other sciences in its persistent reliance on its early
literature, going back as far as the eighteenth century. Each new study of a
species must be based on a thorough analysis of all previous published studies.
Unlike The Jepson Manual
(1993), the Flora contains information
with specific references to individual specimens, flowering times, original descriptions,
and other, detailed information that could not be fit into the field-portable Jepson Manual.
Inasmuch as many botanical locations originally described by Jepson have
been disturbed or even obliterated (the Los Angeles basin, the dunes of San
Francisco, etc.), any botanical information on now urbanized or disturbed
sites becomes treasured; thus the original Flora's importance may be
expected to increase over time. However, access to the existing volumes is
limited because the volumes have been out of print for many years (a
limited number is still available from the Jepson Herbarium). Past
attempts to use commercial OCR systems on similar books have failed,
presenting a rigorous challenge for current research.
The Jepson Herbarium
is currently working on a new Jepson Herbarium
Initiative which will use the digitized version of the older Flora as a
template to be revised, updated, and completed for all species in
California. The new Flora will be distributed online. This effort depends
on access to the original Jepson Flora, which has not been previously
available on-line either as image or text, and is difficult to find in
print. The availability of the Flora in digital format with indexes would
greatly facilitate not only the production of a complete and updated A
Flora of California, but would provide a valuable resource for researchers
of California plants around the world.
THE SCANNING PROCESS
The Bookscanner
Initial imaging of these volumes was completed using an experimental
prototype scanner developed by XEROX PARC.
The PARC Bookscanner
is designed for use on rare and fragile books, such as
A Flora of California, with minimal impact or damage. Traditional flat-bed scanners
will break the spine and cause delicate pages to tear from too much operator handing.
The unique design of the bookscanner carefully cradles the book, supporting the spine
while both open pages are scanned at high resolution simultaneously. Image files in TIF format are
then automatically cataloged in sequential order in a digital archive. With this system, scanning
rates of up to 280 pages an hour are possible.
Q & A about our Experience
Q: What was your actual throughput per hour?
A: Approximately 165 pages per hour. (initial scanning took 2 long days for 2200+ pages) After sanity checking, approximately 120 page images had to be rescanned out of a total of 2200+ page images, mostly the result of inadvertently cropped margins and other operator error)
Q: What resolution did you scan at and what format were the files stored in?
A: Images were acquired at 300 dpi as grayscale TIFF (native format for scanner)
They were converted to jpegs for web viewing.
Q: Did you find that the scanner gave you a quality image without damaging the book?
A: Yes. The Jepson Flora volumes were in generally good shape, and common enough that we were not worried if minor damage occurred. Throughput might be significantly reduced for a more rare/fragile/brittle text that requires delicate handling. The cradle with drop-down scanner seemed to be a good design for valuable texts.
We did find there to be some difference in exposure on left vs. right side pages which was never fully explained, and which led to some problems later when binarizing the images for OCR. One side would come out more exposed, which caused problems when trying to batch process all files at once.
Tweaking of settings during scanning may correct this. Quality of the originals was generally good for our purposes, which were to show the full detail of the page (including scientific illustrations.)
Ultimately we ran into the aforementioned problems with Left/Right exposure, which prevented us from uniformly correcting the exposure for all pages, that combined with a highly specialized botanical vocabulary led to OCR error rates that were unacceptable for text recognition and indexing. Lacking additional staff time to devote to this project, the electronic Jepson Flora currently stands available as an image-only product.