There are a variety of decisions to be made about how to present our digital collections online.  One of these decisions is what format to use to upload the digital items to our hosted CONTENTdm server.

For single items such as photographs and slides, uploading jpgs of the individual images is an obvious decision.  For multipage items such as letters, diaries, books, and serial volumes, there are other options.  For multipage items with typed text, we combine all the scanned images of each page into one PDF and then use Optical Character Recognition (OCR) to read that text and make it searchable online.  CONTENTdm automatically pulls the text from OCR-ed PDFs and puts it in searchable metadata fields for each page of the item.  For handwritten items such as letters and diaries, the OCR will not work.  These items need to be read and transcribed by a person who converts the handwritten content into typed text so that it can be searched online.  When we first created a few of our digital collections of multipage handwritten items we used a very handy process available in CONTENTdm that automatically matched up the jpg image of each page with the corresponding text file of the typed text of that page.

Our CONTENTdm license allows for a maximum of 10,000 items on the hosted server, and we recently came very close to that limit.  We thought that we had simply gotten to the point where we would have to pay more for the next level of storage to be able to add more items.  However, I looked at the number of items in each of our digital collections and noticed that three of the collections had a much higher item count than the number of items actually in the collection. These three were the collections with handwritten items that had been transcribed.  This is because CONTENTdm counts items differently depending on how they are uploaded to the server.

CivilWar-screenshot

Our Civil War collection currently has 10 items: 8 diaries and 2 letters.  The official count for this collection was 795 items since it counted every jpg of every page of those 10 items.  I realized that if we put those items up as PDFs, they would be counted as only 10 items no matter how many pages there are in each PDF.  However, this method requires a lot more work.  When each item is uploaded as a single PDF instead of multiple jpgs, there is no automated way to have the files of transcribed text match up with each page.  Instead, I have to copy and paste the transcribed text of each page into the searchable metadata field corresponding to each page.

DollarSignHowever, the payoff is that, when I was done making these changes, the item count for our Civil War collection decreased from 795 items to 10 items and disc usage for that collection decreased from 5.112 GB to 1.0 GB.  By using this same process, I can reduce another similar collection from 1,635 items down to only 39 items, and yet another collection can be reduced from 346 items down to 16 items.  When I’m done changing all of these collections, I will have given us the ability to add 2,711 more items to our collections before we have to spend thousands of dollars to upgrade to the next level of storage.  So, by choosing a different format and investing my time in a few days of copying and pasting, we can save a few thousand dollars.

Advertisements