Revision as of 19:10, 30 July 2015

Main Page > Development > Development/Projects > Development/Projects/AABC Data Migration Toolkit > UNBC

Overview

Screenshot of final NBCA -> ICA-AtoM PDI Spoon migration job

UNBC Archives is the first participant to be migrated as part of the AABC Data Migration Toolkit project. The original project specification included migrating data from MS SQL Server, MS Access and ICA-AtoM (Release 1.0.9) MySQL databases as well as linking over 5,000 digital asset files (image and video).

Data migration from the MS SQL Server data was performed via the Pentaho Data Integration (PDI) Spoon (pictured) extract, transform, load tool - see Technical Tool Options for other options considered, and reasons for selecting PDI.

Due to the pilot nature of the project, there were a number of unforeseen technical challenges and the project ran over schedule.

Original sources for migration

One (1) MS SQL Server database backup file

NBCA_backup_201004080030.bak

Four (4) MS Access databases

Fawcett.mdb
Ray Williston Collection.mdb
Cassiar_Box1_on.mdb
Cassiar-Box751onwards.mdb

One (1) archive (ZIP) file containing digital objects (image and video files)

Uploads.zip

Data from one (1) ICA-AtoM Release 1.0.9 instance (set up in late 2010?), including description and digital assets

http://www.ica-atom.org/unbc

Data migrated

The final migrated data includes

The NBCA MS SQL Server database data as detailed in the #NBCA database mappings section below
The digital files from the Uploads.zip archive
The ICA-AtoM Release 1.0.9 UNBC website (hosted by Artefactual) including digital assets

Data not migrated

MS Access databases

None of the four MS Access databases provided were migrated, for the following reasons

the Cassiar fonds files (Cassiar_Box1_on.mdb, Cassiar-Box751onwards.mdb) were never scheduled for migration
in the course of the project it was determined that the "Ray Williston Collection" data was present in the NBCA (MS SQL Server) database, so migration of the Ray Williston Collection.mdb database was not necessary
The Fawcett.mdb database will be migrated after development of a CSV import methodology

Accession data

ICA-AtoM Release 1.1 has no provision for storing accession data. The future ICA-AtoM Release 1.2 will include an accession module, and this data will be migrated at a future date.

Accession data is stored in several tables in the NBCA database, and these were not migrated:

Accession - accession and appraisal data
Individual - most of the data in this table is specific to appraisers and donors
AppraiserSpecialty - links appraisers (Individual) to their area of specialty
Receipt - accession receipts?

Other MS SQL Server tables not migrated

SpecifiedLocation - locations are specified at the Fonds level, and are very general (e.g. "Photograph room", "Mobile shelving", "On display in library"). UNBC agreed that migrating this data was not necessary.
Lookup - contained controlled vocabulary maintained in the Lookups table was not applicable in the ICA-AtoM application.

Project deliverables

Migration of all source data to a single ICA-AtoM Release 1.1 website (hosted by Artefactual)
Integration of linked digital assets
This process documentation

Methodology

Migrate ICA-AtoM Release 1.0.9 data (with attached digital assets) to ICA-AtoM Release 1.1 data schema.
Install MS SQL Server 2005 Express and load the NBCA database from backup file provided
Use the Pentaho Data Integration (PDI) Spoon tool to create a data migration script to migrate the NBCA database to the ICA-AtoM data model
Attach digital assets from Uploads.zip archive to NBCA descriptions
Refine the data migration script through several iterations
Set up an install of ICA-AtoM for UNBC to audit the migrated data - including Release 1.0.9 data, NBCA database data, and all digital assets from both
UNBC audits migrated data on the auditing site
Correct any problems with the migration script found in the data audit
Do the final migration of the data to the live UNBC site (hosted by Artefactual)

Extract

Restore UNBC MSSQL Backup

Restore database dialog

Download and install SQL Server 2005 Express Edition with Advanced Services SP3
Installion options
- Components: Management Studio Express, Full-text search, Connectivity Components, Replication?
- Mixed Authentication Mode (Windows and SQL Server)
- Add user to the SQL admin role
Copy NBCA_backup_201004080030.bak to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Backup\NBCA_backup_201004080030.bak
Run SQL Server Management Studio express
Right-click "Databases" > Restore
To database "NBCA" (or whatever).
From device (browse to NBCA_backup_201004080030.bak)
Click checkbox next to backup filename
OK

Setup MS SQL Server for remote connections

Create new SQL Server login

Create SQL Server login

In left pane of "Microsoft SQL Server Management Studio Express" click to expand "Security > Logins"
Right-click either "Login" folder or in Login pane and select "New Login..."
Enter username
Select "SQL Server authentication"
Enter password, and confirm
Uncheck "Enforce password expiration"
Default database "NBCA"
OK

Give read access to new login

Right-click newly created login, and select "Properties"
Click "User-Mappings" in left pane
Click checkbox next to "NBCA" database in top-right pane
Click "db_datareader" permission in lower-right pane
Click OK

Turn on TCP/IP access

"Start > Microsoft SQL Server 2005 > Configuration Tools > SQL Server Configuration Manager"
In left pane, click "SQL Server 2005 Network Configuration > Protocols for SQLEXPRESS"
Right-click "TCP/IP" in right pane, and click "Properties"
click "IP Addresses" tab
For desired TCP/IP connection select "Enabled: Yes", "TCP Dynamic Ports: <blank>", "TCP Port: 1433". OK (You will get a warning that server must be restarted for changes to effect) (See screenshot)
Right-click "TCP/IP" and click "Enable" (Warning again)
Click "SQL Server 2005 Services" in left pane
In right pane, right-click "SQL Server (SQLEXPRESS)" and click "Restart"

Troubleshooting TCP/IP config

In several instances the static port settings above did not work properly with SQL Server 2005. Fixing this issue required modification the following Windows registry keys to set the Port value to 1433.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\<InstanceName>\MSSQLServer\SuperSocketNetLib\Tcp
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\<MSSQL.x>\MSSQLServer\SuperSocketNetLib\Tcp\IPAll

Connecting to MSSQL with PDI Spoon via JDBC

PDI Spoon JDBC connection dialog

PDI Spoon is a Java GUI for creating PDI Transformation scripts. We will use it to connect to MSSQL with the JDBC connector. Please read the Spoon documentation for general instructions on using the tool.

Add new "Input > Table input" step
Double-click the Table input step
Click the "New" connection button,
Select the MS SQL Server connector, and fill in the form like the screenshot below
Click "Test"
Click OK

Note

NOTE: The "MS SQL Server (Native)" connector did not work in our tests, so the vanilla "MS SQL Server" connector was used

Transform

NBCA database mappings

Table: Fond

Source column	Dest. column	Notes
Id	keymap.source_id	keymap table maps source keys to target keys
Accession	note_i18n.content	Store as RAD "Alpha-numeric designations" note
Title	information_object_i18n.title
DerivedNameOne	none	calculated value based on Title. Not migrated
DerivedNameTwo	none	calculated value based on Title. Not migrated
BeginCirca	event.start_date	Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.start_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
EndCirca	event.end_date	Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.end_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
Restrictions	information_object_i18n.access_conditions	RAD label "Restrictions on access"
Copyright	none	All rows are blank in source table
SourceOfTitle	note_i18n.content	RAD note type: "Source of title proper"
TitleAndAccessNotes	none	All rows are blank in source table
PhysicalDescription	information_object_i18n.extent_and_medium
Conservation	none	All rows are blank in source table
ScopeAndContent	information_object_i18n.scope_and_content
Series	none	Series list will be derived from parent/child relationship
PhysicalDescriptionNotes	none	All rows are blank in source table
CustodialHistoryPrivate	none	ICA-AtoM doesn't currently have a column for storing "private" data. Migrate with accession data after Release 1.2
CustodialHistoryPublic	information_object_i18n.archival_history	The "archival_history" column is labeled "Custodial history" in the RAD template.
CreatorHistory	none	Source column contains a mix of: creator name, dates of existence and immediate source of acquisition. Normalized creator name is imported from UNBC edited csv file "creator.csv"
ProvenanceNotes	none	All rows are blank in source table
KeyWords	term_i18n.name, actor_i18n.authorized_form_of_name	Exported to csv file. UNBC manually normalized CSV data to name, place or subject "access points" using Google Refine. The modified csv file was then imported using PDI Spoon
BioAdminHistory	actor_i18n.history
BioAdminHistNotes	none	All rows are blank in source table
LangaugeNotes	none	Only three rows in Fonds table have non-English languages, migrate manually
Record	none	One row contains "FOR-17800", migrate manually
Arrangement	information_object_i18n.arrangement
FindAids	information_object_i18n.finding_aids
GeneralNotes	notes_i18n.content	General note type
ArchivistId	none	Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb	status.status_id	false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy	none	Machine generated data that is not relevant to ICA-AtoM
CreatedOn	object.created_at
ModifiedBy	none	Machine generated data that is not relevant to ICA-AtoM
ModifiedOn	object.updated_at
Deleted	none	Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: FondSeries

Source column	Dest. column	Notes
Id	keymap.source_id	The keymap table maps source keys to target keys
Accession	note_i18n.content	Store as RAD "Alpha-numeric designations" note
Title	information_object_i18n.title	Trim trailing periods from title in transformation
DerivedNameOne	none	Calculated value based on Title. Not migrated
DerivedNameTwo	none	Calculated value based on Title. Not migrated
FondId	information_object.parent_id	Foreign key self-relation. Use keymap table to lookup parent (Fond) target key
BeginCirca	event.start_date	Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.start_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
EndCirca	event.end_date	Parse regular dates (e.g. "22 February 1960", "Jul. 1983", "1922") to event.start_date (date type). Copy all (regular and irregular) dates to event_i18n.date (string type)
SourceOfTitle	note_i18n.content	RAD note type: "Source of title proper"
TitleAndAccessNotes	none	All rows are blank in source table
PhysicalDescription	information_object_i18n.extent_and_medium
ScopeAndContent	information_object_i18n.scope_and_content
PhysicalDescriptionNotes	none	All rows are blank in source table
BioAdminHistory	none	Field data is often inconsistent with intended use. Can't determine authoritative name for person or organization
KeyWords	none	All rows are blank in source table
CustodialHistoryPrivate	none	All rows are blank in source table
CustodialHistoryPublic	information_object_i18n.archival_history	The "archival_history" column is labeled "Custodial history" in the RAD template.
CreatorHistory	none	All rows are blank in source table
ProvenanceNotes	none	All rows are blank in source table
Record	none	16 rows in database, all part of "Aleza Lake Research Forest Archival Records fonds". Migrate manually
Arrangement	information_object_i18n.arrangement
GeneralNotes	notes_i18n.content	General note type
ArchivistId	none	Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb	status.status_id	false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy	none	Machine generated data that is not relevant to ICA-AtoM
CreatedOn	object.created_at
ModifiedBy	none	Machine generated data that is not relevant to ICA-AtoM
ModifiedOn	object.updated_at
Deleted	none	Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: FondSeriesFile

Source column	Dest. column	Notes
Id	keymap.source_id	The keymap table maps source keys to target keys
Accession	note_i18n.content	Store as RAD "Alpha-numeric designations" note
Title	information_object_i18n.title	Trim trailing periods from title in transformation
DerivedNameOne	none	Calculated value based on Title. Not migrated
DerivedNameTwo	none	Calculated value based on Title. Not migrated
FondSeriesId	information_object.parent_id	Foreign key self-relation. Use keymap table to lookup parent (FondSeries) target key
CreationCirca	event.date	Start date may be copied to event.start_date, if the format is recognized
SourceOfTitle	note_i18n.content	RAD note type: "Source of title proper"
Copyright	none	All rows are blank in source table
AccessRestrictions	information_object_i18n.access_conditions	RAD label "Restrictions on access"
TitleAndAccessNotes	note_i18n.content	Contains a mix of "Physical condition", "Cartographic scale", Physical location, previous Accession #'s and general notes. Migrated as "General Note"
GeneralMaterialDesignation	none	All rows are blank in source table
PhysicalDescription	information_object_i18n.extent_and_medium
LocationId	none	All rows are blank in source table
ScopeAndContent	information_object_i18n.scope_and_content
Conservation	note_i18n.content	"Conservation" note type
PhysicalDescriptionNotes	none	All rows are blank in source table
BioAdminHistory	none	Field data is mostly inconsistent with intended use. Can't algorithmically determine authoritative name for person or organization
KeyWords	none	All rows are blank in source table
CustodialHistoryPrivate	none	ICA-AtoM doesn't currently have a column for storing private data. Migrate with accession data after Release 1.2?
CustodialHistoryPublic	none	All rows are blank in source table
ProvenanceNotes	none	All rows are blank in source table
PhotographerOrArtist	note_i18n.content	Migrated as "General note"
Record	none	Appropriate RAD designation is unclear. Ran out of time to migrate (146 rows contain data)
Arrangement	none	All rows are blank in source table
GeneralNotes	notes_i18n.content	General note type
ArchivistId	none	Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb	status.status_id	false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy	none	Machine generated data that is not relevant to ICA-AtoM
CreatedOn	object.created_at
ModifiedBy	none	Machine generated data that is not relevant to ICA-AtoM
ModifiedOn	object.updated_at
Deleted	none	Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: FondSeriesItem

Source column	Dest. column	Notes
Id	keymap.source_id	The keymap table maps source keys to target keys
Accession	note_i18n.content	Store as RAD "Alpha-numeric designations" note
Title	information_object_i18n.title	Trim trailing periods from title in transformation
DerivedNameOne	none	Calculated value based on Title. Not migrated
DerivedNameTwo	none	Calculated value based on Title. Not migrated
FondSeriesFileId	none	All rows blank in source table
FondSeriesId	information_object.parent_id	Foreign key self-relation. Use keymap table to lookup parent (FondSeries) target key
CreationCirca	event.date	In addition, start date may be copied to event.start_date, if the format is recognized
SourceOfTitle	note_i18n.content	RAD note type: "Source of title proper"
Copyright	none	One value for "Bob Bossin’s Home Remedy for Nucelar War" Show. Migrate manually
AccessRestrictions	information_object_i18n.access_conditions	RAD label "Restrictions on access"
TitleAndAccessNotes	note_i18n.content	Contains a mix of "Source of title proper", "Terms governing use, reproduction and publication", "Attributions and conjectures" (?) and general notes. Migrated to "General notes"
GeneralMaterialDesignation	object_term_relation.term_id	Match closest available term in "General material designation" taxonomy
PhysicalDescription	information_object_i18n.extent_and_medium
ScopeAndContent	information_object_i18n.scope_and_content
Conservation	note_i18n.content	"Conservation" note type
PhysicalDescriptionNotes	none	All rows are blank in source table
BioAdminHistory	none	Field data is mostly inconsistent with intended use (mostly "Part of" designation). Can't algorithmically determine authoritative name for person or organization
KeyWords	term_i18n.name, actor_i18n.authorized_form_of_name	Exported to csv file. UNBC manually normalized CSV data to name, place or subject "access points" using Google Refine. The modified csv file was then imported using PDI Spoon
CustodialHistoryPrivate	none	ICA-AtoM doesn't currently have a column for storing private data. Migrate with accession data after Release 1.2?
CustodialHistoryPublic	none	All rows are blank in source table
ProvenanceNotes	none	All rows are blank in source table
PhotographerOrArtist	note_i18n.content	Migrated as "General note"
Record	none	Appropriate RAD designation is unclear. Ran out of time to migrate (3561 rows contain data)
Arrangement	none	All rows are blank in source table
GeneralNotes	notes_i18n.content	General note type
ArchivistId	none	Ran out of time to migrate the archivist name to "revision history"
ShowOnWeb	status.status_id	false = "Draft" (status_id: 159), true = "published" (status_id: 160)
CreatedBy	none	Machine generated data that is not relevant to ICA-AtoM
CreatedOn	object.created_at
ModifiedBy	none	Machine generated data that is not relevant to ICA-AtoM
ModifiedOn	object.updated_at
Deleted	none	Manually remove 'deleted' rows from Qubit. Can't do this during migration because of data integrity issues, some deleted rows have non-deleted children

Table: Attachment

Source column	Dest. column	Notes
SourcePath	digitalobject.path	See #Digital asset import below for details
Remaining columns in this table were not migrated

Table: FondSeriesItemAttachement

Source column	Dest. column	Notes
AttachmentId	digitalobject.information_object_id	See #Digital asset import below for details
FileSeriesItemId	information_object.id	See #Digital asset import below for details
Remaining columns in this table were not migrated

ICA-AtoM Release 1.0.9 data

Data from the existing http://ica-atom.org/unbc database was migrated to ICA-AtoM Release 1.1 data schema via the standard ICA-AtoM upgrade process.

Other transform sources

AccessPointEntity.ktr PDI Spoon transform script creating actor (name) and term (subject, place) rows from access points

AccessPointRelation.ktr PDI Spoon transform script creating description->actor (name) and description->term (subject, place) relations from access points

Creator name

The NBCA.Fond table had no standard field for recording the creator of a Fonds, so the CreatorHistory and Keywords columns were used in an ad-hoc manner. Because neither the name format nor the column used was consistent, there was no automated way to extract the name of the Fonds' creator. The solution was to create a CSV export file from the NBCA Fond table and for the UNBC archivists to manually enter a creator name, where the creator was known.

The creator.csv file was created by export of the NBCA Fond table (columns: Id, Title, CreatorHistory and BioAdminHistory) and was then manually edited by the UNBC Archives staff. The edited creator.csv file is loaded into ICA-AtoM by the PDI Spoon Transformation "Creators.ktr".

Keywords

In the NBCA Fond and FondSeries source tables, the Keywords column contains a semi-colon separated list of strings. This data is defined via "Access Points" in the ICA-AtoM model. Because ICA-AtoM distinguishes between name, subject and place access points, and there is no automated way to categorize all of the original strings, this required a multi-part process:

Export a comma-separated file containing all Keywords column data from the NBCA database (only present in Fond and FondSeries tables)
Run automated normalization and data-cleaning using Google Refine (Artefactual)
Manual data cleaning via Google Refine (UNBC)
Import of cleaned and normalized CSV file via PDI Spoon

The complex data structure of ICA-AtoM further complicated the import process (step #4) as can been seen in the AccessPointEntity.ktr and AccessPointRelation.ktr screenshots.

Digital asset import

Digital assets locations were exported from the NBCA MS SQL Server database to a CSV file via PDI Spoon transformation script then the actual digital assets were imported and linked to a description via the digitalobject:load task. The transformation script also mapped the source id (FondSeriesItemAttachement.FondSeriesItemId) to the target key (information_object.id) via the keymap table, and did a string search & replace on the Attachment.SourcePath column to substitute the local path for each digital asset.

Load

Load existing ICA-AtoM install data

Data from the http://ica-atom.org/unbc site can be loaded via mysqldump backup file. This step must be performed before any other migrations because it destroys any existing data in the target database.

Load NBCA data

The PDI Spoon NBCA_Transform.kjb job writes data directly to the target ICA-AtoM MySQL database.

Load digital objects

Digital objects are loaded using the digitalobject:load task using the nbca_digitalobjects.csv file to link assets to archival descriptions.

Post-processing

Several final tasks must be performed on the final ICA-AtoM instance data which require direct access to the ICA-AtoM/symfony framework

Build nested set - See: Build nested set task
Generate permalink slugs - See: Generate slugs task
Build search index

@@ Line 547: / Line 547: @@
 == Load digital objects ==
-Digital objects are loaded using the [[AABC_Data_Migration_Toolkit#Digital_object_load_task|digitalobject:load]] task using the [[#Digital assets|nbca_digitalobjects.csv]] file to link assets to archival descriptions.
+Digital objects are loaded using the [[Development/Projects/AABC Data Migration Toolkit#Digital_object_load_task|digitalobject:load]] task using the [[#Digital assets|nbca_digitalobjects.csv]] file to link assets to archival descriptions.
 = Post-processing =

Difference between revisions of "Development/Projects/AABC Data Migration Toolkit/UNBC"

Revision as of 19:10, 30 July 2015

Contents

Overview

Original sources for migration

Data migrated