ArchivesCanada: June 2015 discussion on updates

From AtoM wiki
Revision as of 17:46, 9 September 2015 by Dan (talk | contribs) (Created page with "{{#pagetitle: ArchivesCanada: June 2015 discussion on updates}} Main Page > Development > Development/Projects > Development/Projects/ArchivesCanada > Develo...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Main Page > Development > Development/Projects > Development/Projects/ArchivesCanada > Development/Projects/ArchivesCanada/Communications > Updates

This sub-page has been created to capture and make publicly available an important and relevant conversation that was had about the ArchivesCanada project, and subsequent updates from provincial and territorial contributors, via the Arcan-L Canadian archival mailing list in June of 2015.

The initial message, posted by Lara Wilson (Chair, CCA) about enhancements to AtoM that will be included in the ArchivesCanada launch, is available on the main ArchivesCanada project communications page, here.

Below are copies of the discussions that followed.

June 3 2015

Thanks for the update, Lara. I'm really interested to hear if the work on the new ArchivesCanada catalogue also includes a plan for regular updates or synchronization between AtoM catalogues. It's wonderful to hear that the work to upload from the regional networks has begun, but I am concerned about updates, duplication of data, etc. Here at Dalhousie, we just had to compromise and accept that we could only manage to load "top-level" descriptions in the new MemoryNS.ca catalogue maintained by the Council of Nova Scotia Archives. There are already discrepancies between the top-level descriptions in our institutional catalogue and MemoryNS. Trying to manually maintain 150,000+ descriptions in two catalogues is not sustainable.

We should all be very concerned about the possibility of institutions having three different published versions of their archival descriptions (local, regional, national). Is CCA planning on addressing this issue? What are other institutions doing to address duplication of their descriptions?

Thanks,

Creighton

Creighton Barrett

Digital Archivist

Dalhousie University Archives

June 8 2015

Dear Creighton and Listserv colleagues –

Thank you for your message, Creighton. We want to advise the community that work on the new ARCHIVESCANADA.ca catalogue will include a plan for regular updates; these updates will vary by region. Details have yet to be confirmed, however, CCA is currently planning a meeting of provincial and territorial councils to discuss new projects and related policy work, including ARCHIVESCANADA.ca.

Creighton, you have identified an important challenge in the synchronization of bulk exports from AtoM to AtoM installations (and other databases to AtoM installations). CCA and Artefactual are investigating options for development of this feature, and the most ideal solutions will require additional funding and a not inconsiderable about of programming. In the meantime, the process for updating records loads across installations will, admittedly, be cumbersome. Artefactual staff will follow up with a more detailed message regarding the options for development.

If institutions are interested in contributing to synchronization development work, please let us know. This is the type of collaboration that supports the open source environment, for everyone’s benefit.

Sincerely,

Lara

Lara Wilson

Chair / présidente

Canadian Council of Archives / Conseil canadien des archives

June 9 2015

Dear Lara,

Thanks for the update on this. Much appreciated! I'm glad to hear that this meeting is being organized and will be looking forward to hearing about development options from Artefactual. I strongly encourage the community to "dream big" and work to fund and develop what we all feel would be the ideal situation for synchronized updates between the many catalogues that now use AtoM. The long-term solution for updates from regional catalogues should also work for individual institutions who need to send updates to their regional catalogues, with or without digital objects.

Very best,

Creighton

June 15 2015

Hi Creighton,

There are definite improvements we could make to AtoM to improve the current EAD, DC, and CSV import and export mechanisms for archival descriptions. We believe that any manual export and import workflow is going to be cumbersome over the long-term for synchronizing data between multiple AtoM databases, as in the Dalhouise use case. We've had quite a bit of internal discussion at Artefactual and we believe the ideal long term solution for AtoM data synchronization is an automated synchronization framework such as OAI-PMH <https://www.openarchives.org/pmh/> or ResourceSync <http://www.openarchives.org/rs/toc>.

OAI-PMH

AtoM has included limited OAI-PMH repository <https://www.accesstomemory.org/en/docs/2.1/user-manual/administer/settings/#oai-repository> functionality since version 1.0.5-beta <https://wiki.accesstomemory.org/Releases/Release_announcements/Release_1.0.5-beta>, released in March 2009, and this has been vastly improved in the 2.2 release. However the current OAI-PMH functionality does not support OAI-PMH harvesting, which is the most obvious barrier to using OAI-PMH to syndicate data from one AtoM instance to another. We’ve found there are several other barriers with the current OAI-PMH implementation that prevent its practical use for AtoM synchronization.

A possible OAI-PMH development roadmap:

1. Add the ability to update an existing description via EAD import. Currently each EAD document imported creates a new archival description. Existing descriptions could be uniquely matched (universally) by using the URL of the description in the source system. Our recent experience with updating descriptions via CSV import has taught us that different update behaviours may be desired by different institutions, especially with regards to empty values in the import file. We would likely need to add some options on import regarding how updates are handled.

2. Create an EAD XML document when a description is updated (at write time) rather than creating it when the resource is requested (at read time). Because EAD documents often encompass an entire Fonds or Collection, it can take several minutes to create a complete EAD XML document. Since the EAD document may be read many times and the source data tends to change slowly or remain static, generating the document at read-time is inefficient. Because of the amount of processing time required to write large EAD documents, writing the EAD document would need to be run in the background using the AtoM background job scheduling functionality being introduced in AtoM 2.2.

3. Expose EAD XML metadata via the OAI-PMH repository module. Currently oai_dc <https://www.openarchives.org/OAI/2.0/oai_dc.xsd> is used for the payload of the OAI-PMH resource, which excludes a great deal of the descriptive metadata present in the ISAD(G), RAD, and DACS descriptive

  standards.
  4.
  Add OAI-PMH harvesting functionality.  An initial set of requirements
  could include the actual data import/update mechanism (which could leverage
  the existing import functionality of AtoM), as well as the ability to add a
  list of OAI-PMH repositories for harvesting, a way to schedule or trigger
  harvesting, and support for repository pagination.
  5.
  Add selective harvesting by datestamp to the OAI-PMH repository and
  harvester modules. This would allow harvesting only the records that have
  be updated since last harvest.

ResourceSync

ResourceSync 1.0 <http://www.openarchives.org/rs/1.0/resourcesync> is a newer (April 2014) synchronization framework developed by the Open Archives Initiative. Cottage Lab's Meeting the OAI-PMH use case with ResourceSync <http://cottagelabs.com/news/meeting-the-oaipmh-use-case-with-resourcesync> provides a good summary of the differences between OAI-PMH and ResourceSync, but for me the defining difference is that OAI-PMH is a metadata synchronization format, whereas ResourceSync is a resource synchronization framework. ResourcSync is appealing because (a) it separates the discovery layer from the content layer and (b) it allows exposing different types of resources. The separation of discovery from content makes it easier to switch from using, for instance, DC XML to EAD XML as a metadata payload. Being resource type agnostic opens the potential to use ResourceSync for synchronizing other types of AtoM data, such as EAC-CPF XML for authority records, SKOS for controlled vocabularies, and digital objects.

If we went with ResourceSync as a synchronization framework, the development process would be identical to OAI-PMH for items #1 and #2. After that, the ResourceSync roadmap would diverge as listed below. Note that The ResourceSync terms “Source” and “Destination” are the equivalent of OAI-PMH’s “Repository” and “Harvester” respectively.

A possible ResourceSync development roadmap:

  1.
  As OAI-PMH above
  2.
  As OAI-PMH above
  3.
  As ResourceSync doesn't include metadata in its response, it's not
  necessary to embed EAD in the response document as it is with OAI-PMH. The
  Source simply provides a list of EAD (Collection/Fonds) URLs available with
  some additional metadata about when the resource was last updated, whether
  it’s a copy from another system, etc.  We would need to do work on the
  Destination side to loop through the resource list and retrieve and import
  the actual EAD XML documents.
  4.
  ResourceSync has the additional requirement of writing a ResourceSync
  Source module for AtoM.  One advantage to sticking with OAI-PMH is that the
  Repository module has already been built.
  5.
  Add a ResourceSync Destination module.  The initial set of requirements
  would be similar to the OAI-PMH Harvester module:  Configure a set of
  ResourceSync Sources to track, add mechanisms to schedule or manually
  trigger synchronization, and support for pagination.
  6.
  Like OAI-PMH item #5, add selective updates by datestamp to the
  ResourceSync Source and Destination modules to allow fetching only
  resources that have been updated since the last synchronization.

We think either OAI-PMH or ResourceSync could meet the Dalhousie -> CNSA -> ArchivesCanada synchronization use case, but we see more potential in the ResourceSync implementation. We especially like the potential for synchronizing additional resources beyond archival description metadata. ResourceSync is also a more modern standard and incorporates more contemporary design and philosophy, such as a RESTful architecture and linked data principles.

For either the OAI-PMH our ResourceSync solution, there is some additional functionality that is worth considering, but isn’t necessary for a basic synchronization workflow.

The following optional enhancements could be applied to either OAI-PMH or ResourceSync:

  -
  Support for marking deleting records in the Source system and
  propagating deletes downstream.
  -
  Support for synchronization by set.  This could be useful for limiting
  synchronization to descriptions from a specific archival institution within
  an AtoM instance.
  -
  Pre-generation or caching of the synchronization metadata (e.g.
  Repository/Source identification, last update, number of resources,
  resource list) to prevent re-generating these XML documents on each request.

Next steps:

Either synchronization solution would require a significant amount of development, which in turn will require funding. Many of the enhancements could be tackled individually and would provide immediate benefits without requiring the full list to be undertaken at once. Both of the suggested roadmaps have been ordered so that earlier tasks stand on their own, and later tasks build on the earlier ones. The ordering could be adjusted based on community need though.

It would be useful to know which of the approaches and features the AtoM community might be interested in funding. If there is strong interest in particular features then we could work with the individual institutions and/or the CCA to flesh out requirements and start costing the work.

If neither of these synchronization options are possible with current funding streams, we can discuss other options. One possibility is to add the ability to update existing descriptions via EAD import (Item #1 on both roadmaps) and add a web GUI for bulk import and export of EAD descriptions via AtoM. This may ease the current synchronization pains experienced by Dalhousie and other institutions on a smaller budget.

We look forward to working with the AtoM user community to develop a long-term plan that will meet a broad set of use cases and ensure consistency among descriptions created and maintained by the Canadian archival community.

Regards,

David Juhasz

Director, AtoM Technical Services

Artefactual Systems Inc.