AABC Toolkit: UNBC data migration

From AtoM wiki

Main Page > Development > Development/Projects > Development/Projects/AABC Data Migration Toolkit > UNBC

Note

This is historical development documentation relating to the AABC Data Migration Toolkit project (2010-2011). It has been migrated here from a now-defunct wiki for reference. The content in this page was first created in December of 2010, and last updated March 23, 2011. This content was migrated to the AtoM wiki in July of 2015. Links have been updated where possible, and some presentation elements may have been to account for style differences between the two wikis. Otherwise, the content is the same. For more information, see the project landing page: AABC Data Migration Toolkit.

Overview[edit]

Screenshot of final NBCA -> ICA-AtoM PDI Spoon migration job

UNBC Archives is the first participant to be migrated as part of the AABC Data Migration Toolkit project. The original project specification included migrating data from MS SQL Server, MS Access and ICA-AtoM (Release 1.0.9) MySQL databases as well as linking over 5,000 digital asset files (image and video).

Data migration from the MS SQL Server data was performed via the Pentaho Data Integration (PDI) Spoon (pictured) extract, transform, load tool - see Technical Tool Options for other options considered, and reasons for selecting PDI.

Due to the pilot nature of the project, there were a number of unforeseen technical challenges and the project ran over schedule.

Original sources for migration[edit]

One (1) MS SQL Server database backup file

  1. NBCA_backup_201004080030.bak

Four (4) MS Access databases

  1. Fawcett.mdb
  2. Ray Williston Collection.mdb
  3. Cassiar_Box1_on.mdb
  4. Cassiar-Box751onwards.mdb

One (1) archive (ZIP) file containing digital objects (image and video files)

  1. Uploads.zip

Data from one (1) ICA-AtoM Release 1.0.9 instance (set up in late 2010?), including description and digital assets

  1. http://www.ica-atom.org/unbc

Data migrated[edit]

The final migrated data includes

  • The NBCA MS SQL Server database data as detailed in the #NBCA database mappings section below
  • The digital files from the Uploads.zip archive
  • The ICA-AtoM Release 1.0.9 UNBC website (hosted by Artefactual) including digital assets

Data not migrated[edit]

MS Access databases[edit]

None of the four MS Access databases provided were migrated, for the following reasons

  • the Cassiar fonds files (Cassiar_Box1_on.mdb, Cassiar-Box751onwards.mdb) were never scheduled for migration
  • in the course of the project it was determined that the "Ray Williston Collection" data was present in the NBCA (MS SQL Server) database, so migration of the Ray Williston Collection.mdb database was not necessary
  • The Fawcett.mdb database will be migrated after development of a CSV import methodology

Accession data[edit]

ICA-AtoM Release 1.1 has no provision for storing accession data. The future ICA-AtoM Release 1.2 will include an accession module, and this data will be migrated at a future date.

Accession data is stored in several tables in the NBCA database, and these were not migrated:

  1. Accession - accession and appraisal data
  2. Individual - most of the data in this table is specific to appraisers and donors
  3. AppraiserSpecialty - links appraisers (Individual) to their area of specialty
  4. Receipt - accession receipts?

Other MS SQL Server tables not migrated[edit]

  1. SpecifiedLocation - locations are specified at the Fonds level, and are very general (e.g. "Photograph room", "Mobile shelving", "On display in library"). UNBC agreed that migrating this data was not necessary.
  2. Lookup - contained controlled vocabulary maintained in the Lookups table was not applicable in the ICA-AtoM application.

Project deliverables[edit]

  1. Migration of all source data to a single ICA-AtoM Release 1.1 website (hosted by Artefactual)
  2. Integration of linked digital assets
  3. This process documentation

Methodology[edit]

  1. Migrate ICA-AtoM Release 1.0.9 data (with attached digital assets) to ICA-AtoM Release 1.1 data schema.
  2. Install MS SQL Server 2005 Express and load the NBCA database from backup file provided
  3. Use the Pentaho Data Integration (PDI) Spoon tool to create a data migration script to migrate the NBCA database to the ICA-AtoM data model
  4. Attach digital assets from Uploads.zip archive to NBCA descriptions
  5. Refine the data migration script through several iterations
  6. Set up an install of ICA-AtoM for UNBC to audit the migrated data - including Release 1.0.9 data, NBCA database data, and all digital assets from both
  7. UNBC audits migrated data on the auditing site
  8. Correct any problems with the migration script found in the data audit
  9. Do the final migration of the data to the live UNBC site (hosted by Artefactual)

Extract[edit]

Restore UNBC MSSQL Backup[edit]

Restore database dialog
  • Download and install SQL Server 2005 Express Edition with Advanced Services SP3
  • Installion options
    • Components: Management Studio Express, Full-text search, Connectivity Components, Replication?
    • Mixed Authentication Mode (Windows and SQL Server)
    • Add user to the SQL admin role
  • Copy NBCA_backup_201004080030.bak to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\NBCA_backup_201004080030.bak
  • Run SQL Server Management Studio express
  • Right-click "Databases" > Restore
  • To database "NBCA" (or whatever).
  • From device (browse to NBCA_backup_201004080030.bak)
  • Click checkbox next to backup filename
  • OK

Setup MS SQL Server for remote connections[edit]

Create new SQL Server login[edit]

Create SQL Server login
  • In left pane of "Microsoft SQL Server Management Studio Express" click to expand "Security > Logins"
  • Right-click either "Login" folder or in Login pane and select "New Login..."
  • Enter username
  • Select "SQL Server authentication"
  • Enter password, and confirm
  • Uncheck "Enforce password expiration"
  • Default database "NBCA"
  • OK

Give read access to new login[edit]

Give read access to new login
  • Right-click newly created login, and select "Properties"
  • Click "User-Mappings" in left pane
  • Click checkbox next to "NBCA" database in top-right pane
  • Click "db_datareader" permission in lower-right pane
  • Click OK

Turn on TCP/IP access[edit]

Turn on TCP/IP access
  • "Start > Microsoft SQL Server 2005 > Configuration Tools > SQL Server Configuration Manager"
  • In left pane, click "SQL Server 2005 Network Configuration > Protocols for SQLEXPRESS"
  • Right-click "TCP/IP" in right pane, and click "Properties"
  • click "IP Addresses" tab
  • For desired TCP/IP connection select "Enabled: Yes", "TCP Dynamic Ports: <blank>", "TCP Port: 1433". OK (You will get a warning that server must be restarted for changes to effect) (See screenshot)
  • Right-click "TCP/IP" and click "Enable" (Warning again)
  • Click "SQL Server 2005 Services" in left pane
  • In right pane, right-click "SQL Server (SQLEXPRESS)" and click "Restart"

Troubleshooting TCP/IP config[edit]

In several instances the static port settings above did not work properly with SQL Server 2005. Fixing this issue required modification the following Windows registry keys to set the Port value to 1433.

  • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\<InstanceName>\MSSQLServer\SuperSocketNetLib\Tcp
  • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\<MSSQL.x>\MSSQLServer\SuperSocketNetLib\Tcp\IPAll

Connecting to MSSQL with PDI Spoon via JDBC[edit]

PDI Spoon JDBC connection dialog

PDI Spoon is a Java GUI for creating PDI Transformation scripts. We will use it to connect to MSSQL with the JDBC connector. Please read the Spoon documentation for general instructions on using the tool.

  • Add new "Input > Table input" step
  • Double-click the Table input step
  • Click the "New" connection button,
  • Select the MS SQL Server connector, and fill in the form like the screenshot below
  • Click "Test"
  • Click OK

Note

NOTE: The "MS SQL Server (Native)" connector did not work in our tests, so the vanilla "MS SQL Server" connector was used

Transform[edit]

NBCA database mappings[edit]

Table: Fond[edit]

Source column Dest. column Notes
Id keymap.source_id keymap table maps source keys to target keys
Accession note_i18n.content Store as RAD "Alpha-numeric designations" note
Title information_object_i18n.title
DerivedNameOne none calculated value based on Title. Not migrated
DerivedNameTwo none calculated value based on Title. Not migrated
BeginCirca event.start_date Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.start_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
EndCirca event.end_date Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.end_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
Restrictions information_object_i18n.access_conditions RAD label "Restrictions on access"
Copyright none All rows are blank in source table
SourceOfTitle note_i18n.content RAD note type: "Source of title proper"
TitleAndAccessNotes none All rows are blank in source table
PhysicalDescription information_object_i18n.extent_and_medium
Conservation none All rows are blank in source table
ScopeAndContent information_object_i18n.scope_and_content
Series none Series list will be derived from parent/child relationship
PhysicalDescriptionNotes none All rows are blank in source table
CustodialHistoryPrivate none ICA-AtoM doesn't currently have a column for storing "private" data. Migrate with accession data after Release 1.2
CustodialHistoryPublic information_object_i18n.archival_history The "archival_history" column is labeled "Custodial history" in the RAD template.
CreatorHistory none Source column contains a mix of: creator name, dates of existence and immediate source of acquisition. Normalized creator name is imported from UNBC edited csv file "creator.csv"
ProvenanceNotes none All rows are blank in source table
KeyWords term_i18n.name, actor_i18n.authorized_form_of_name Exported to csv file. UNBC manually normalized CSV data to name, place or subject "access points" using Google Refine. The modified csv file was then imported using PDI Spoon
BioAdminHistory actor_i18n.history
BioAdminHistNotes none All rows are blank in source table
LangaugeNotes none Only three rows in Fonds table have non-English languages, migrate manually
Record none One row contains "FOR-17800", migrate manually
Arrangement information_object_i18n.arrangement
FindAids information_object_i18n.finding_aids
GeneralNotes notes_i18n.content General note type
ArchivistId none Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb status.status_id false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy none Machine generated data that is not relevant to ICA-AtoM
CreatedOn object.created_at
ModifiedBy none Machine generated data that is not relevant to ICA-AtoM
ModifiedOn object.updated_at
Deleted none Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: FondSeries[edit]

Source column Dest. column Notes
Id keymap.source_id The keymap table maps source keys to target keys
Accession note_i18n.content Store as RAD "Alpha-numeric designations" note
Title information_object_i18n.title Trim trailing periods from title in transformation
DerivedNameOne none Calculated value based on Title. Not migrated
DerivedNameTwo none Calculated value based on Title. Not migrated
FondId information_object.parent_id Foreign key self-relation. Use keymap table to lookup parent (Fond) target key
BeginCirca event.start_date Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.start_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
EndCirca event.end_date Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.start_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
SourceOfTitle note_i18n.content RAD note type: "Source of title proper"
TitleAndAccessNotes none All rows are blank in source table
PhysicalDescription information_object_i18n.extent_and_medium
ScopeAndContent information_object_i18n.scope_and_content
PhysicalDescriptionNotes none All rows are blank in source table
BioAdminHistory none Field data is often inconsistent with intended use. Can't determine authoritative name for person or organization
KeyWords none All rows are blank in source table
CustodialHistoryPrivate none All rows are blank in source table
CustodialHistoryPublic information_object_i18n.archival_history The "archival_history" column is labeled "Custodial history" in the RAD template.
CreatorHistory none All rows are blank in source table
ProvenanceNotes none All rows are blank in source table
Record none 16 rows in database, all part of "Aleza Lake Research Forest Archival Records fonds". Migrate manually
Arrangement information_object_i18n.arrangement
GeneralNotes notes_i18n.content General note type
ArchivistId none Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb status.status_id false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy none Machine generated data that is not relevant to ICA-AtoM
CreatedOn object.created_at
ModifiedBy none Machine generated data that is not relevant to ICA-AtoM
ModifiedOn object.updated_at
Deleted none Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: FondSeriesFile[edit]

Source column Dest. column Notes
Id keymap.source_id The keymap table maps source keys to target keys
Accession note_i18n.content Store as RAD "Alpha-numeric designations" note
Title information_object_i18n.title Trim trailing periods from title in transformation
DerivedNameOne none Calculated value based on Title. Not migrated
DerivedNameTwo none Calculated value based on Title. Not migrated
FondSeriesId information_object.parent_id Foreign key self-relation. Use keymap table to lookup parent (FondSeries) target key
CreationCirca event.date Start date may be copied to event.start_date, if the format is recognized
SourceOfTitle note_i18n.content RAD note type: "Source of title proper"
Copyright none All rows are blank in source table
AccessRestrictions information_object_i18n.access_conditions RAD label "Restrictions on access"
TitleAndAccessNotes note_i18n.content Contains a mix of "Physical condition", "Cartographic scale", Physical location, previous Accession #'s and general notes. Migrated as "General Note"
GeneralMaterialDesignation none All rows are blank in source table
PhysicalDescription information_object_i18n.extent_and_medium
LocationId none All rows are blank in source table
ScopeAndContent information_object_i18n.scope_and_content
Conservation note_i18n.content "Conservation" note type
PhysicalDescriptionNotes none All rows are blank in source table
BioAdminHistory none Field data is mostly inconsistent with intended use. Can't algorithmically determine authoritative name for person or organization
KeyWords none All rows are blank in source table
CustodialHistoryPrivate none ICA-AtoM doesn't currently have a column for storing private data. Migrate with accession data after Release 1.2?
CustodialHistoryPublic none All rows are blank in source table
ProvenanceNotes none All rows are blank in source table
PhotographerOrArtist note_i18n.content Migrated as "General note"
Record none Appropriate RAD designation is unclear. Ran out of time to migrate (146 rows contain data)
Arrangement none All rows are blank in source table
GeneralNotes notes_i18n.content General note type
ArchivistId none Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb status.status_id false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy none Machine generated data that is not relevant to ICA-AtoM
CreatedOn object.created_at
ModifiedBy none Machine generated data that is not relevant to ICA-AtoM
ModifiedOn object.updated_at
Deleted none Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: FondSeriesItem[edit]

Source column Dest. column Notes
Id keymap.source_id The keymap table maps source keys to target keys
Accession note_i18n.content Store as RAD "Alpha-numeric designations" note
Title information_object_i18n.title Trim trailing periods from title in transformation
DerivedNameOne none Calculated value based on Title. Not migrated
DerivedNameTwo none Calculated value based on Title. Not migrated
FondSeriesFileId none All rows blank in source table
FondSeriesId information_object.parent_id Foreign key self-relation. Use keymap table to lookup parent (FondSeries) target key
CreationCirca event.date In addition, start date may be copied to event.start_date, if the format is recognized
SourceOfTitle note_i18n.content RAD note type: "Source of title proper"
Copyright none One value for "Bob Bossin’s Home Remedy for Nucelar War" Show. Migrate manually
AccessRestrictions information_object_i18n.access_conditions RAD label "Restrictions on access"
TitleAndAccessNotes note_i18n.content Contains a mix of "Source of title proper", "Terms governing use, reproduction and publication", "Attributions and conjectures" (?) and general notes. Migrated to "General notes"
GeneralMaterialDesignation object_term_relation.term_id Match closest available term in "General material designation" taxonomy
PhysicalDescription information_object_i18n.extent_and_medium
ScopeAndContent information_object_i18n.scope_and_content
Conservation note_i18n.content "Conservation" note type
PhysicalDescriptionNotes none All rows are blank in source table
BioAdminHistory none Field data is mostly inconsistent with intended use (mostly "Part of" designation). Can't algorithmically determine authoritative name for person or organization
KeyWords term_i18n.name, actor_i18n.authorized_form_of_name Exported to csv file. UNBC manually normalized CSV data to name, place or subject "access points" using Google Refine. The modified csv file was then imported using PDI Spoon
CustodialHistoryPrivate none ICA-AtoM doesn't currently have a column for storing private data. Migrate with accession data after Release 1.2?
CustodialHistoryPublic none All rows are blank in source table
ProvenanceNotes none All rows are blank in source table
PhotographerOrArtist note_i18n.content Migrated as "General note"
Record none Appropriate RAD designation is unclear. Ran out of time to migrate (3561 rows contain data)
Arrangement none All rows are blank in source table
GeneralNotes notes_i18n.content General note type
ArchivistId none Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb status.status_id false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy none Machine generated data that is not relevant to ICA-AtoM
CreatedOn object.created_at
ModifiedBy none Machine generated data that is not relevant to ICA-AtoM
ModifiedOn object.updated_at
Deleted none Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: Attachment[edit]

Source column Dest. column Notes
SourcePath digitalobject.path See #Digital asset import below for details
Remaining columns in this table were not migrated

Table: FondSeriesItemAttachement[edit]

Source column Dest. column Notes
AttachmentId digitalobject.information_object_id See #Digital asset import below for details
FileSeriesItemId information_object.id See #Digital asset import below for details
Remaining columns in this table were not migrated

ICA-AtoM Release 1.0.9 data[edit]

Data from the existing http://ica-atom.org/unbc database was migrated to ICA-AtoM Release 1.1 data schema via the standard ICA-AtoM upgrade process.

Other transform sources[edit]

AccessPointEntity.ktr PDI Spoon transform script creating actor (name) and term (subject, place) rows from access points
AccessPointRelation.ktr PDI Spoon transform script creating description->actor (name) and description->term (subject, place) relations from access points

Creator name[edit]

The NBCA.Fond table had no standard field for recording the creator of a Fonds, so the CreatorHistory and Keywords columns were used in an ad-hoc manner. Because neither the name format nor the column used was consistent, there was no automated way to extract the name of the Fonds' creator. The solution was to create a CSV export file from the NBCA Fond table and for the UNBC archivists to manually enter a creator name, where the creator was known.

The creator.csv file was created by export of the NBCA Fond table (columns: Id, Title, CreatorHistory and BioAdminHistory) and was then manually edited by the UNBC Archives staff. The edited creator.csv file is loaded into ICA-AtoM by the PDI Spoon Transformation "Creators.ktr".

Keywords[edit]

In the NBCA Fond and FondSeries source tables, the Keywords column contains a semi-colon separated list of strings. This data is defined via "Access Points" in the ICA-AtoM model. Because ICA-AtoM distinguishes between name, subject and place access points, and there is no automated way to categorize all of the original strings, this required a multi-part process:

  1. Export a comma-separated file containing all Keywords column data from the NBCA database (only present in Fond and FondSeries tables)
  2. Run automated normalization and data-cleaning using Google Refine (Artefactual)
  3. Manual data cleaning via Google Refine (UNBC)
  4. Import of cleaned and normalized CSV file via PDI Spoon

The complex data structure of ICA-AtoM further complicated the import process (step #4) as can been seen in the AccessPointEntity.ktr and AccessPointRelation.ktr screenshots.

Digital asset import[edit]

Digital assets locations were exported from the NBCA MS SQL Server database to a CSV file via PDI Spoon transformation script then the actual digital assets were imported and linked to a description via the digitalobject:load task. The transformation script also mapped the source id (FondSeriesItemAttachement.FondSeriesItemId) to the target key (information_object.id) via the keymap table, and did a string search & replace on the Attachment.SourcePath column to substitute the local path for each digital asset.


Load[edit]

Load existing ICA-AtoM install data[edit]

Data from the http://ica-atom.org/unbc site can be loaded via mysqldump backup file. This step must be performed before any other migrations because it destroys any existing data in the target database.

Load NBCA data[edit]

The PDI Spoon NBCA_Transform.kjb job writes data directly to the target ICA-AtoM MySQL database.

Load digital objects[edit]

Digital objects are loaded using the digitalobject:load task using the nbca_digitalobjects.csv file to link assets to archival descriptions.

Post-processing[edit]

Several final tasks must be performed on the final ICA-AtoM instance data which require direct access to the ICA-AtoM/symfony framework

  1. Build nested set - See: Build nested set task
  2. Generate permalink slugs - See: Generate slugs task
  3. Build search index