Colleens experiments
April 28, 2008
Today was the first day of trying out the Contentdm harvesting tool. As always, this first attempt to use a new software application was full of frustrations but by the end of the day, I was able to move around quite readily and was able to harvest the 5 ADECA annual reports. Here are the problems I foresee using this software:
1. It is clunky! One harvests and "ingests" in OCLC Connexion. When the ingest is complete, a URL is created and placed in the bib record. Clicking on that link allows the cataloger to go out on the Web and see the new object in Contentdm but only after waiting up to 30 minutes for OCLC to reindex.
2. Each issue of the serial has to be harvested and ingested separately in order to have issue level metadata attached to it. This proved to be very time-consuming because the files are large and take a long time to open.
3. Contentdm does not allow one to harvest a website and then exclude unwanted files, as the old software did. This means that harvesting is all or nothing.
4. One must log into the admin. module of Contentdm in order to edit metadata. The ultimate result is that the cataloger will have several pages open at a time:
1. The local catalog (perhaps)
2. OCLC client
3. Contendm Admin module
4. Contendm Collection (viewing) module
5. The government site being harvested.
5. The sort, both in the public view and in the Admin module is quite quirky. There may be some work arounds.
May 6, 2008
I discovered Monday that the Web Harvester is broken. I called the OCLC Technical Support to report this which, as it turned out, was not the place to take these support issues to because they are out of the loop. I didn't realize that until they called this a.m. to tell me that the harvester has never worked and would not be a working tool for the foreseeable future.
The problem is now in the hands of those who need to know and can see to fixing the problem.
The bigger problem may be the ARC viewer's inability to deal with some subset of the PDFs we need to capture. There is not yet a predicted date by which OCLC expects to have that fixed. So our OCLC/Contentdm liaison sent out a memo to tell all of us (Contentdm users) to use Connexion's "Digital Import" function, until it is. Unfortunately, this is an added service which must be paid for. I have asked OCLC to send us a quote since they will not give pricing information over the phone.
May 9, 2008
The harvester has been fixed. It was not the harvester that was broken but my interaction with it. The problem turned out to be an ampersand in a collection name. Apparently, no one knew that symbols could or would affect performance. In any case, I renamed the file and we have been up and running for a couple of days. I have managed to harvest more documents. However, I still have the frustration of never knowing whether or not I will be able to harvest the document that I am cataloging. I have to have a bib record first and then I can try to harvest the content it is describing.
On Wed. I talked to a sales rep who called about Connexion's "Digital Import" function. I told her why I was interested in it and asked her to see if the DI group would consider making it available to us Contentdm users for free, until the OCLC harvester is able to handle all PDFs. I have not yet heard anything further.
May 12, 2008
We have access to Connexion's Digital Import function and without extra cost. It is simply amazing that there is so little understanding at OCLC about just what Contentdm is and how OCLC is involved with it. It makes it very hard for the people who answer the phones, to help users like me with the questions that come up. Our liaison told me that all OCLC members who have full cataloging authorizations and a hosted Contentdm server can use this function without extra charge. However, it adds many more steps to the harvesting process. Still, it is the work around, until the Arc viewer is fully functional.
The bad news about that is that the viewer is just not working well with the existing architecture and the developers do not know when it will be. (I am paraphrasing our OCLC/Contentdm liaison). She, in turn, will send out more information about that when she has some.
The question of what I should be cataloging is becoming very pressing. I should soon be able to harvest the rest of the criminal justice statistics and the ADPH vital statistics this week. I have been cataloging various annual reports (Dept. of Personnel, Public Finance, ADEM, etc. and I have cataloged some of their newsletters and a couple of monographic publications, as well, e.g. Healthy Alabama 2010.)
With nearly a year to go, I can get many, many more titles cataloged and archived. What titles should they be?
May 13, 2008
Using the "attach Digital Content" function, I succeeded in archiving the County Health Profiles that are available (2004-2006), a monograph, "Healthy Alabama 2010", and a few ADPH annual reports (2000-2002). The bad news is, it has taken me nearly 6 hours to do it. Some of this time was spent learning how to use this new tool; it has proved very tricky to edit the files, as they must be, in Contentdm. The bigger problem has been the nearly endless wait as each file uploads (10-17 minutes each) on top of the time it takes to load the original pdf and save it to my computer. Of course, this is a work around until the harvester works but, as I learned yesterday, that is not going to happen quickly.
May 14, 2008
The new timeline for converting from the OCLC Digital Archive to Contentdm (copied from message sent to the users' list):
May 19th
We will begin the data conversion from current Digital Archive to new Digital Archive. We’ll also be working on fixing the display problems.
Mid-June
We expect to have the display problems fixed. You will have another ‘beta testing’ period to use the new Harvester in Connexion Client. The current web harvesting tools in Connexion Browser and the current Digital Archive will still be available to you. We will begin the conversion of files from the new Digital Archive to CONTENTdm.
Late July
Current web harvesting tools in Connexion Browser and the current Digital Archive are ‘turned off’. Any remaining data in the current Digital Archive is converted to the new Archive and to CONTENTdm
May 21, 2008
Unfortunately, the "attach digital content" function (hereafter referred to as ADC) cannot solve all problems. I tried to harvest the annual reports of the Division of Risk Management and, after waiting about 10 minutes for the file to upload, I received the error message "Could not access content". At this point, I have no way to determine if the problem is with the "attach digital content" function or with the file. I am inclined to think it is the file. I chose to try to harvest it because it is one of a number of documents I have seen at various agencies which have been broken into separate files, usually (so far as I have seen) of approximately 10 pages each. (See an example here: http://www.riskmgt.alabama.gov/Downloads.aspx#Annual_Reports and click on Annual Reports.) I wanted to see if Contentdm could handle this. In theory it should present no problems.
Another problem I have run into may be exclusive to a feature that the ADC has that we do not have to enable and that may also be a problem on the agency's end. ADC breaks every document into separate pages, if that feature is enabled (I enabled it on the recommendation of OCLC/Contentdm). This facilitates printing just the page that one wants and one can add just that page to one's "favorites". I dislike it very much, as it prevents scrolling through the document. (To see an example you might compare any of the Ala. Dept of Environmental Management annual reports (except for 2004) and any annual report of the Ala. Public Service Commission.) I would like to know what the task force thinks. I can easily disable this feature, if there is general agreement that it is not helpful.
What I found for the first time was that there were only 3 pages-- each of which contained multiple pages. Moreover, they were out of order, i.e. when I clicked on p. 1 I got pages 32 and all the following til the end. p. 2 gave me p. 1-16 of the newsletter and p.3 gave me 17-31. This was not one of the titles that was broken into separate files that I mentioned above. It was one file that looked fine when viewed at the agency website. I am baffled and have sent a message to our liaison about this.
May 28, 2008
The OCLC/Contentdm group had the first of three online meetings in which current Contentdm users talk about their workflows and other issues related to archiving state documents. Today, Nick Robinson spoke on behalf of four small libraries working collaboratively in California. They are in year two of the work. They have defined the group of documents that they are archiving very precisely; they are interested in city and county budgets, grand jury reports, local planning documents and something else that I did not catch. They have determined that there are certain kinds of documents that they cannot harvest, e.g. pdfs with internal links to other documents. They have gotten around some problems by uploading documents to their own server and then harvesting them.
I learned that they track every document that they catalog and harvest on a spreadsheet which is used by students to check the websites of the various agencies for new documents on a regular basis (they check approximately twice a year). I was surprised because I have been under the impression all along that there was some sort of functionality built into the tools that allowed this to be automated. That is not the case.
Judy Cobb, our OCLC/Contentdm liaison updated us on various timelines. Of particular importance is the availability of the new viewer which will be rolled out on June 16. So far, she reports, it has performed in test flawlessly. From June 16-July 27 the new viewer will be in beta test. As of July 27 the current browser-based system will no longer be available.
Next week, Stephen Slovasky (Connecticut State Library) will be discussing workflow and cataloging issues in that library.
I have continued to catalog documents that I think will be desirable. Today I finished a record for "Selected maternal and child health statistics, Alabama, 2006" and tried to harvest it. The attempt failed and the error message returned appears to relate to security measures protecting the document. I have referred the problem to Judy Cobb.
June 3, 2008
I have started adding dates to titles to indicated when they were archived.
June 11, 2008
I have made an added entry for our project on all the titles I have archived. You will see this on my records: 710 2# NAAL State Documents Project
The 710 is an (author/creator/contributor) added entry and putting it on the records for our project titles makes retrieving a list of all of them from OCLC very easy for anyone who wants to have one.
June 13, 2008
Today I discovered, by accident, that new content has been added at the Alabama Criminal Justice Information Center; 2 new annual reports for 2007 have been added for Crime in Alabama and Domestic Violence in Alabama (additionally, 2006 has been scanned and added). This had to happen, sooner or later, but it has only been a month in one case and less than a month in the other that those titles were added to Contentdm. I simply do not now have the time to recheck everything and there is absolutely no way automated way to monitor the websites of the various agencies for changes-- it must be done manually.
June 20, 2008
I am very pleased to report that the redesigned PDF viewer appears to be working flawlessly. I have been able to go back and capture files that previously eluded capture.
I also wanted to mention that I have had a number of very positive, helpful interactions with people at various agencies whom I have had to contact for one reason or another. On Wed. I emailed ADEM to ask if it might be possible for the 2006 annual report to be redone, as its size (213.79 MB!) made it virtually unloadable. Now, I supposed that this *might* happen sometime in the future, as time permitted. So, it was quite a surprise to receive an answer less than 24 hours later that the report had been turned into a mere 2.2 MB file. It has been captured and now we have a complete (as of today) run of this agency's annual reports.
Aug. 13, 2008
I have cataloged ACHE's "Statistical Abstract: Higher Education in Alabama" which actually required two records as the Abstract became "Statistical Reports in 2003/2004. 2005/2006- cannot be archived at this time as they consist of multiple unlinked pdf files. The introduction to each states that there is (will be) a link to the whole report but that link has not been activated for the later reports.
All of the files are large and very slow to load. They are also proving very slow to harvest. The harvest has not finished even yet, a full hour after it was started. The harvester will try for 24 hours (or more) before giving up. This brings up a question that should ultimately be settled by the group: Do we want to archive files that are so slow to load that our patrons are not likely to wait for them?
Sept. 18, 2008 I added Selected maternal and child health statistics, Alabama, 2005 today (I had previously cataloged but neglected to put in the archived title list Selected maternal and child health statistics, Alabama 2006.) This publication caused me some anxiety because a reasonable argument could be made that this is a serial and not a series of monographic publications. I finally decided that I would treat them as monographs, in order to preserve distinctive information about each, but if another one is published, they will have to be redone as a serial.
archived titles
Comments (0)
You don't have permission to comment on this page.