ArchivesCanada: June 2015 discussion on updates

From AtoM wiki

Main Page > Development > Development/Projects > Development/Projects/ArchivesCanada > Development/Projects/ArchivesCanada/Communications > Updates

This sub-page has been created to capture and make publicly available an important and relevant conversation that was had about the ArchivesCanada project, and subsequent updates from provincial and territorial contributors, via the Arcan-L Canadian archival mailing list in June of 2015.

The initial message, posted by Lara Wilson (Chair, CCA) about enhancements to AtoM that will be included in the ArchivesCanada launch, is available on the main ArchivesCanada project communications page, here.

Below are copies of the discussions that followed.

June 3 2015

Thanks for the update, Lara. I'm really interested to hear if the work on the new ArchivesCanada catalogue also includes a plan for regular updates or synchronization between AtoM catalogues. It's wonderful to hear that the work to upload from the regional networks has begun, but I am concerned about updates, duplication of data, etc. Here at Dalhousie, we just had to compromise and accept that we could only manage to load "top-level" descriptions in the new MemoryNS.ca catalogue maintained by the Council of Nova Scotia Archives. There are already discrepancies between the top-level descriptions in our institutional catalogue and MemoryNS. Trying to manually maintain 150,000+ descriptions in two catalogues is not sustainable.

We should all be very concerned about the possibility of institutions having three different published versions of their archival descriptions (local, regional, national). Is CCA planning on addressing this issue? What are other institutions doing to address duplication of their descriptions?

Thanks,

Creighton Barrett

Digital Archivist

Dalhousie University Archives

June 8 2015

Dear Creighton and Listserv colleagues –

Thank you for your message, Creighton. We want to advise the community that work on the new ARCHIVESCANADA.ca catalogue will include a plan for regular updates; these updates will vary by region. Details have yet to be confirmed, however, CCA is currently planning a meeting of provincial and territorial councils to discuss new projects and related policy work, including ARCHIVESCANADA.ca.

Creighton, you have identified an important challenge in the synchronization of bulk exports from AtoM to AtoM installations (and other databases to AtoM installations). CCA and Artefactual are investigating options for development of this feature, and the most ideal solutions will require additional funding and a not inconsiderable about of programming. In the meantime, the process for updating records loads across installations will, admittedly, be cumbersome. Artefactual staff will follow up with a more detailed message regarding the options for development.

If institutions are interested in contributing to synchronization development work, please let us know. This is the type of collaboration that supports the open source environment, for everyone’s benefit.

Sincerely,

Lara Wilson

Chair / présidente

Canadian Council of Archives / Conseil canadien des archives

June 9 2015

Dear Lara,

Thanks for the update on this. Much appreciated! I'm glad to hear that this meeting is being organized and will be looking forward to hearing about development options from Artefactual. I strongly encourage the community to "dream big" and work to fund and develop what we all feel would be the ideal situation for synchronized updates between the many catalogues that now use AtoM. The long-term solution for updates from regional catalogues should also work for individual institutions who need to send updates to their regional catalogues, with or without digital objects.

Very best,

Creighton

June 15 2015

Hi Creighton,

There are definite improvements we could make to AtoM to improve the current EAD, DC, and CSV import and export mechanisms for archival descriptions. We believe that any manual export and import workflow is going to be cumbersome over the long-term for synchronizing data between multiple AtoM databases, as in the Dalhouise use case. We've had quite a bit of internal discussion at Artefactual and we believe the ideal long term solution for AtoM data synchronization is an automated synchronization framework such as OAI-PMH <https://www.openarchives.org/pmh/> or ResourceSync <http://www.openarchives.org/rs/toc>.

OAI-PMH

AtoM has included limited OAI-PMH repository <https://www.accesstomemory.org/en/docs/2.1/user-manual/administer/settings/#oai-repository> functionality since version 1.0.5-beta <https://wiki.accesstomemory.org/Releases/Release_announcements/Release_1.0.5-beta>, released in March 2009, and this has been vastly improved in the 2.2 release. However the current OAI-PMH functionality does not support OAI-PMH harvesting, which is the most obvious barrier to using OAI-PMH to syndicate data from one AtoM instance to another. We’ve found there are several other barriers with the current OAI-PMH implementation that prevent its practical use for AtoM synchronization.

A possible OAI-PMH development roadmap:

1. Add the ability to update an existing description via EAD import. Currently each EAD document imported creates a new archival description. Existing descriptions could be uniquely matched (universally) by using the URL of the description in the source system. Our recent experience with updating descriptions via CSV import has taught us that different update behaviours may be desired by different institutions, especially with regards to empty values in the import file. We would likely need to add some options on import regarding how updates are handled.

2. Create an EAD XML document when a description is updated (at write time) rather than creating it when the resource is requested (at read time). Because EAD documents often encompass an entire Fonds or Collection, it can take several minutes to create a complete EAD XML document. Since the EAD document may be read many times and the source data tends to change slowly or remain static, generating the document at read-time is inefficient. Because of the amount of processing time required to write large EAD documents, writing the EAD document would need to be run in the background using the AtoM background job scheduling functionality being introduced in AtoM 2.2.

3. Expose EAD XML metadata via the OAI-PMH repository module. Currently oai_dc <https://www.openarchives.org/OAI/2.0/oai_dc.xsd> is used for the payload of the OAI-PMH resource, which excludes a great deal of the descriptive metadata present in the ISAD(G), RAD, and DACS descriptive standards.

4. Add OAI-PMH harvesting functionality. An initial set of requirements could include the actual data import/update mechanism (which could leverage the existing import functionality of AtoM), as well as the ability to add a list of OAI-PMH repositories for harvesting, a way to schedule or trigger harvesting, and support for repository pagination.

5. Add selective harvesting by datestamp to the OAI-PMH repository and harvester modules. This would allow harvesting only the records that have been updated since last harvest.

ResourceSync

ResourceSync 1.0 <http://www.openarchives.org/rs/1.0/resourcesync> is a newer (April 2014) synchronization framework developed by the Open Archives Initiative. Cottage Lab's Meeting the OAI-PMH use case with ResourceSync <http://cottagelabs.com/news/meeting-the-oaipmh-use-case-with-resourcesync> provides a good summary of the differences between OAI-PMH and ResourceSync, but for me the defining difference is that OAI-PMH is a metadata synchronization format, whereas ResourceSync is a resource synchronization framework. ResourcSync is appealing because (a) it separates the discovery layer from the content layer and (b) it allows exposing different types of resources. The separation of discovery from content makes it easier to switch from using, for instance, DC XML to EAD XML as a metadata payload. Being resource type agnostic opens the potential to use ResourceSync for synchronizing other types of AtoM data, such as EAC-CPF XML for authority records, SKOS for controlled vocabularies, and digital objects.

If we went with ResourceSync as a synchronization framework, the development process would be identical to OAI-PMH for items #1 and #2. After that, the ResourceSync roadmap would diverge as listed below. Note that The ResourceSync terms “Source” and “Destination” are the equivalent of OAI-PMH’s “Repository” and “Harvester” respectively.

A possible ResourceSync development roadmap:

1. As OAI-PMH above

2. As OAI-PMH above

3. As ResourceSync doesn't include metadata in its response, it's not necessary to embed EAD in the response document as it is with OAI-PMH. The Source simply provides a list of EAD (Collection/Fonds) URLs available with some additional metadata about when the resource was last updated, whether it’s a copy from another system, etc. We would need to do work on the Destination side to loop through the resource list and retrieve and import the actual EAD XML documents.

4. ResourceSync has the additional requirement of writing a ResourceSync Source module for AtoM. One advantage to sticking with OAI-PMH is that the Repository module has already been built.

5. Add a ResourceSync Destination module. The initial set of requirements would be similar to the OAI-PMH Harvester module: Configure a set of ResourceSync Sources to track, add mechanisms to schedule or manually trigger synchronization, and support for pagination.

6. Like OAI-PMH item no. 5, add selective updates by datestamp to the ResourceSync Source and Destination modules to allow fetching only resources that have been updated since the last synchronization.

We think either OAI-PMH or ResourceSync could meet the Dalhousie -> CNSA -> ArchivesCanada synchronization use case, but we see more potential in the ResourceSync implementation. We especially like the potential for synchronizing additional resources beyond archival description metadata. ResourceSync is also a more modern standard and incorporates more contemporary design and philosophy, such as a RESTful architecture and linked data principles.

For either the OAI-PMH our ResourceSync solution, there is some additional functionality that is worth considering, but isn’t necessary for a basic synchronization workflow. The following optional enhancements could be applied to either OAI-PMH or ResourceSync:

  • Support for marking deleting records in the Source system and propagating deletes downstream.
  • Support for synchronization by set. This could be useful for limiting synchronization to descriptions from a specific archival institution within an AtoM instance.
  • Pre-generation or caching of the synchronization metadata (e.g. Repository/Source identification, last update, number of resources, resource list) to prevent re-generating these XML documents on each request.

Next steps:

Either synchronization solution would require a significant amount of development, which in turn will require funding. Many of the enhancements could be tackled individually and would provide immediate benefits without requiring the full list to be undertaken at once. Both of the suggested roadmaps have been ordered so that earlier tasks stand on their own, and later tasks build on the earlier ones. The ordering could be adjusted based on community need though.

It would be useful to know which of the approaches and features the AtoM community might be interested in funding. If there is strong interest in particular features then we could work with the individual institutions and/or the CCA to flesh out requirements and start costing the work.

If neither of these synchronization options are possible with current funding streams, we can discuss other options. One possibility is to add the ability to update existing descriptions via EAD import (Item 1 on both roadmaps) and add a web GUI for bulk import and export of EAD descriptions via AtoM. This may ease the current synchronization pains experienced by Dalhousie and other institutions on a smaller budget.

We look forward to working with the AtoM user community to develop a long-term plan that will meet a broad set of use cases and ensure consistency among descriptions created and maintained by the Canadian archival community.

Regards,

David Juhasz

Director, AtoM Technical Services

Artefactual Systems Inc.

June 19 2015

Hi David,

Thanks so much for this detailed explanation of the development options. I really appreciate how you have broken out the roadmap for each approach. I don't know much about ResourceSync, but it does sound like an attractive option given the potential for synchronizing other types of data. Either way, it's wonderful to see a list of development tasks that "stand alone" but also relate to each other.

One idea that emerged briefly at TAATU was whether the national catalogue could be federated search engine, a discovery tool that indexes data from multiple sources using something like Blacklight or the emerging ArcLight application. Another option would be to consider something like XTF, which would allow institutions to send bulk EAD files on a periodic basis and dump them into a central repository. Neither option would lead to the kind of improvements in AtoM that you have outlined below, nor would they lead to a national catalogue based on AtoM, but they speak to a philosophical question of how we want the national catalogue to function and what we want it to do. It's hard to say what the best options are without knowing what resources are available from CCA, where this fits in terms of institutional and regional council priorities, what the timeframe is, etc.

Anyway, this is a very helpful message and I appreciate the thought you put into laying out some of our options. I'm curious to hear what others think about these roadmaps.

Cheers,

Creighton

June 25 2015

Hi Creighton,

Thanks for continuing this thread, and for sharing some alternative possibilities. We’d love to hear more about exactly what you’re envisioning with a Blacklight/ArcLight solution. Are you looking at something like DPLA [1] or Europeana [2] as models - where items are available for searching, with links out to the source systems?

Of course we at Artefactual have a bias towards AtoM as a solution for ArchivesCanada, but it’s always worthwhile to consider other options. As you may know, Artefactual has been involved in advocacy and development work to encourage Blacklight - and subsequently ArcLight - to be able to work with both ElasticSearch and Solr. We invest a lot of energy in encouraging the open source community working in the cultural heritage sector to consider how generalizing solutions allows for greater integration, reuse, and collaboration among existing projects. Elasticsearch integration with Blacklight would not necessarily be required for the project you’re envisioning, but it’s one way we’re already thinking about long-term integration potentials. Our interest in Blacklight up to this point has mostly been with an eye to integrating it with Archivematica to improve Archivematica’s search functionality.

We would recommend a modern, vital project like Blacklight as opposed to something like XTF, which hasn’t seen a release since 2012 [3]. The ArcLight design and process looks very promising, but it is still in the planning stages and development is not scheduled to start until 2016. It’s difficult to speculate right now about how suitable ArcLight will be for ArchivesCanada and what development might be required when it is released. Getting involved in the ArcLight discussion list and planning process may be a good way to encourage developing the features that would be useful in a federated portal context.

Blacklight seems like the best candidate of the options you’ve suggested, and we don’t have any strong alternatives to recommend. Blacklight can index EAD documents via a Solr parsing library developed by a Blacklight community member [4]. There are some nice elements already built into Blacklight’s user interface, and the ability to create virtual exhibits via the Spotlight plugin is appealing [5]. Updating a static collection of EAD documents (e.g. delete and replace with a newer version when necessary) may make updates more straightforward than trying to synchronize records in AtoM. This proposed solution also avoids having many modules in ArchivesCanada that are present in AtoM but would not be needed, such as taxonomy management, multiple repositories, etc.

In that case, a possible development path might look like:

1. Ensure that Blacklight can index EAD 2002 XML documents created by AtoM.

2. Update the Blacklight parser and Solr index schema to support Archives-specific metadata and facets

3. Develop a RAD-based display template for Blacklight

4. Develop an ArchivesCanada theme for Blacklight

5. [Optional] Create an AtoM web interface for doing a bulk export of EAD documents. This would require integration with the job scheduler to do this export in the background and prevent browser timeouts.

However, even this alternative has some potential issues worth mentioning.

First, the CCA is also seeking to replace its current Directory of Archives [6], and instead of maintaining a separate database, the CCA chose to invest some advanced repository search development into AtoM 2.3, so that the new ArchivesCanada site can double as the Directory of Archives. With a Blacklight based approach, the CCA would then have to create and maintain a new alternative system for this information.

One issue I have with something like a Blacklight implementation is the flat view it gives public users - it removes all context until the user follows the link back to the system of record. Sites like Europeana and DPLA have taken this approach largely because they have mixed holdings, from libraries, museums, galleries, and archives - while ArchivesCanada would be archives specific. Additionally, Blacklight does not seem to have browse capabilities built in - and I find that alienating to users, asking “what do you want?” instead of welcoming them in. So that might be an additional development task - a way for users to browse holdings. ArcLight may solve some of these issues, but it’s too early to tell.

Another question we’d need to consider would be how multilingual content would be handled in a Blacklight interface. I’m not sure if there are plugins that could manage this, or if further development would be required, but for a national archival portal, users should be able to view the interface in either French or English, and content in either language should be supported.

A document based search index like Blacklight doesn’t get around the need for institutions to manually export their archival data and send it to both provincial and national portals. Maybe this is not as much of an issue as it with AtoM as there is no need to update existing records. Institutions can export their whole collection each time and the resulting batch EAD documents can be dropped into Blacklight as a whole. This does raise the issue of performance on both the AtoM export and ArchivesCanada side, so performance optimization would be helpful here.

Finally, it’s worth pointing out that the development required to implement this alternative is not insignificant, so at some point we need to consider which path will most benefit the Canadian archival community in the long run. However, it’s exactly this kind of dialogue that we value so much - seeing the Canadian archival community strategize together, dream big, and offer solutions that we might never have considered in isolation. Thanks for that.

Regards,

Dan Gillean, MAS, MLIS

AtoM Product Manager / Systems Analyst,

Artefactual Systems, Inc

[1] http://dp.la/ [2] http://www.europeana.eu/portal/ [3] http://xtf.cdlib.org/documentation/changelog/#3.1 [4] https://github.com/awead/solr_ead [5] https://github.com/sul-dlss/spotlight [6] http://www.cdncouncilarchives.ca/directory_adv.html

July 3 2015

Hi Dan,

Thanks so much for your thorough explanation of the pros and cons here. To answer your first question, yes I was thinking something like DLPA or Europeana where items are available for searching and search results link out to other systems.

I think there are probably ways around the flat display you describe. The Rock and Roll Hall of Fame, for example, has an implementation of Blacklight where archival items are shown with the collection name and even the hierarchy. Search "Hendrix" and filter by "Archival item": http://catalog.rockhall.com/catalog?f[format_sim][]=Archival+Item&q=hendrix&search_field=all_fields

Is that what you mean? It is one of the best implementations of Blacklight I've seen, a great example for anyone who wants to see what that system is capable of.

You do raise some important questions about multilingual capacity and the development required for such a catalogue. It would be a huge effort, for sure. And maybe we're too far into this to look at our options, in which case it's a moot point. I suppose the problem I have, speaking for myself here, is that the community hasn't actually had an open dialogue about the best long-term solution for the national catalogue or where the resources are going. I think AtoM could make a great national catalogue, but it's not usually a good idea to start with the tool and work backwards. We are now about to create what could become a metadata boondoggle because we don't have update and migration plans in place. I also find it troubling to hear that long-term strategies for the catalogue now cannot be addressed without also looking at the directory of archives. Without being privy to the details of the project, it seems that it would have been much better for CCA to invest in one of the AtoM development road maps you mentioned previously (OAI or API).

Nevertheless, I am encouraged to hear that things now seem to be moving again and applaud all the work that has gone into updating the national catalogue. I look forward to hearing more from the CCA as the project progresses!

Best regards,

Creighton