Digitization

The data included in the Florida Soils Characterization Retrieval System was collected by the Environmental Pedology  Laboratory of the Soil Science Department of the Institute of Food and Agricultural Sciences of the University of Florida, in cooperation with the Soil Conservation Service of the U.S. Department of Agriculture. The data were originally presented as seven printed books. Material for several additional books were generated but never compiled and published. The research was partially supported by State Legislative appropriations (administered by the Department of Agriculture and Consumer Services), and supplemental funds contributed by the participating counties in support of the Florida Cooperative Soil Survey. The purpose was to “compile and preserve laboratory and morphological data resulting from soil survey and selected research activities in the state of Florida.” (Characterization Data for Selected Florida Soils, Book 1, p.iii).

The material can be characterized as:

Volume
Book 1
Book 2
Book 3
Book 4
Book 5
Book 6
Book 7
Reference Guide*
Loose Leafs
Published in:
1974
June 1978
June 1981
December 1985
June 1988
November 1989
November 1990
1992
unpublished
Covering:
1965 to mid-1973
1973 to mid-1975
mid-1975 to 1977
1977 to 1979
1979 to 1982
1982 to 1984
1984 to 1986
1965 to 1986
1986 to 1996
Pages:
294
336
306
306
291
308
318
43
~300


*Soil Series and County Indices

The cast of professors editing these books changed from year to year and book to book, as well as the cast of technicians, graduate students, student assistants, secretaries. It was a huge undertaking, and transferring all that information to a form useable on the internet was the work of almost a year by many people (see the Acknowledgements section). All the data had to be scanned, proofread several times, and converted to an XML format. Soils personnel then had to update much of the classification that had changed over the years, and the html version had to be proofread. Location data was sometimes excellent, often confusing or erroneous and sometimes nonexistent. All of which required many students to labor over ArcGIS to find all the sites. Most of our data sources were composed of two parts: meta data and lab data. Meta data gives the physical description about the sample location and other information related to it. Lab data are the analytical results from the soil testing lab.

To be used in the online database, these data needed to be digitized. None of the metadata and only approximately 75% of the lab data were available in a digital format. Besides the books, there were hundreds of loose pages (some without lab data or with incomplete or lacking meta data and some so faded they had to be digitized in order to be read at all).

Each profile was to be assigned a unique integer identifier for each county. Thus the combination of the county number (derived from the order in the list of alphabetized county names) and the profile number should have been a unique identification key. As an example, the first profile from Alachua County would be assigned the identifier S01_001. Unfortunately, every possible identification error was detected. Individual profiles were repeated in different volumes and given different identification codes. Different profiles were given the same identification code. The profile sequences within a county might be not be sequential or start at "1" and there were some discrepancies between the county code number and the identity of the county given in the metadata. All such errors and the resulting corrections are documented in the 'Comment' tags of each profile.
Digitization Process

The process started with scanning the pages for each profile using OmniPage Pro 9. “Scan Image” was chosen to scan the original paper document, then “OCR and Proof” was chosen to capture the text from the scanned image. The obvious spelling and format errors were corrected and the document was saved as a Word file. To expedite the work, a utility program was run to coordinate OmniPro and Word and to auto-generate appropriate file names and folders for each profile scanned. If the lab data were not already in digital format, the data were entered manually. Subsequently, a person other than the original proof-reader would proofread the documents so generated.

After the second proof-reading, the data were transferred into a set of XML fields using another utility program. A series of scans of the data were then run to detect problems or ambiguities (such as confusion over the units used) and such errors were rectified. Simultaneously, Mr. Wade Hurt (of the USGS, assigned to UF) hand-annotated the original volumes for changes in soil classifications and any other errors that he detected. Those annotations were then added to the XML fields. In an effort to 'leave fingerprints', we have opted to preserve the original entries along with the corrections. In the event of a correction, the original material was enclosed in curly brackets and the new material was inserted in square brackets. As an example:
     ....grayish-brown (10YR 4/2) {loamy sand}[fine sand]; weak fine granular...
in which "loamy sand" is to be replaced by "fine sand".

A number of fields were created to reflect the original location of the data. Some of these are:

  • BOOK_NUMBER = The volume as listed above
  • META_PAGE = the page on which the metadata were found
  • TABLE_PAGE = the page on which the lab data were found
  • COMMENT = any comment concerning the profile in general