AABC Toolkit: City of Vancouver data migration

Main Page > Development > Development/Projects > Development/Projects/AABC Data Migration Toolkit > CVA

Data sources

Content server data

approx. 125,000 Private Records (from private donors)
approx. 87,000 Public Records (City created documents)
approx. 9,700 Maps and Pamphlets
approx. 3,350 Public documents (published materials issued by the City)
812 Accession records
224 Authority records (for Public Records only)
195 Authority records relationships
88 Ephemera records (private sources)

Other sources

??? ICA-AtoM 1.1 records
approx. 70,000 digital objects (TIFF)

Process

For Content Server data:

Exported data to CSV from Content Server,
Transformed data to match ICA-AtoM RAD CSV template via "Pentaho Data Integration" (PDI) tool,
imported into ICA-AtoM via CSV import command line (CLI) script.

For ICA-AtoM 1.1 data:

Migrate and import via standard ICA-AtoM upgrade procedure.

For digital objects:

Processed RAW TIFFs via Archivematica to produce DIPs (expand?),
Then imported using CLI DIP import script

Importing

Import of the private records initially took 1 second per record on a conventional workstation, but an import speed of 0.12 seconds per record, initially, was achieved using a ~15GB RAM Rackspace Cloud instance (~$0.90/hour?) with an Ubuntu 11.04 LAMP stack. Import would slow down as it went, however, so the CSV dump was split into 10000 row chunks. The import slows down as it goes, but we were able to import 125K rows in ~12 hours after using a patch to disable nested set (hierarchy) updating. Post-import search indexing is very slow still.

Following are the steps to do the import (using Gnu Screen to keep the session alive).

SSH to import server
Download CSV data to /root
Enter apt-get update
Enter apt-get install -y tasksel subversion screen
Enter tasksel install lamp-server
Enter /usr/sbin/apache2ctl restart
Enter cd /var/www
Check out qubit Subversion trunk to qubit
Enter chown -R www-data:www-data qubit
Create qubit MySQL database
Visit Qubit with browser to complete installation
Enter cd qubit
Enter screen
Enter ./symfony csv:import path/to/first/import/file --rows-until-update=5 --error-log=/root/errors.log --source-name=source_name
In the import command, "source_name" is used when recording mapping of legacy IDs to new Qubit IDs

Setup after initial import.

Update nested set data: ./symfony propel:build-nested-set (took 3.33 minutes for 125K imported objects)

Challenges

Private Records CSV export included 100+ columns, and over 1 million data points.
Google Refine could not handle this much data, and could not be used for data transformation or normalization

Data mapping

LEGEND
"B" - [mostly] blank (haven't checked all records)
"T" - human readable text
"R" - machine readable (i.e. populated from a value list)
Problem mapping
Requires pre-processing
Do not migrate

Private record data mapping

Original CVA column	RAD CSV template column	Qubit column	Notes
ACCESSION NUMBER (B)	accessionNumber	accession.identifier (via `relation`)	Add an accession record related via `relation` table (many-to-many)
ADMIN HIST OR BIO (T)	creatorsHistory	actor_i18n.history, event.type_id = QubitTerm::CREATION_ID	Creators are an `actor` linked to information_object via the many-to-many `event` table (event.information_object_id <-> event.actor_id). The event.type_id column is required and is defined in lib/model/QubitTerm.php as a class constant (i.e. CREATION_ID = 111). The related `actor` row needs to be the same one used for Creators (actor_i18n.authorized_form_of_name) column. The `event` row needs to be the same one used for Creation date (event_i18n.dates), START YEAR (event.start_date) and END YEAR (event.end_date).
ARCH ADDRESS (B)	scopeAndContent	information_object_i18n.scope_and_content	This column will be merged into the "Scope and content" column before import (See "Migration Instructions - CS Records to ICA AtoM - Public Records Export Set.XLS")
ARCHIVIST (RT)	revision	information_object_i18n.revision	Prefix with "Created by "
ARTIST (B)	nameAccessPoints	actor_i18n.authorized_form_of_name (relation.type_id = QubitTerm::NAME_ACCESS_POINT_ID)	Name access point is related to information object via many-to-many "relation" table (relation.subject_id = information_object.id, relation.object_id = actor.id). Need to test if an actor already exists with the specified name, if actor name DOES exist, then just create a new relation between actor and information object, if actor DOES NOT exist, then create a new actor row with actor_i18n.authorized_form_of_name equal to column value, then link via relation.
ASSOCIATED MATERIAL (B)	relatedUnitsOfDescription	information_object_i18n.related_units_of_description
CONSERVATION NOTE (B)	radNoteConservation	note_i18n.content, note.type_id = ???	Each "Rad note" type is a separate row in the "note" table, related to information object by (one-to-many) fk note.information_object_id. Getting the note.type_id (required) is tricky, you'll have to search like `SELECT term.id FROM term, term_i18n i18n WHERE term.id = i18n.id and term.taxonomy_id = 51 and i18n.culture = 'en' and i18n.name = 'Conservation';` (N.B. QubitTaxonomy::RAD_NOTE_ID = 51)
COPYRIGHT EXPIRES (B)	copyrightExpires	rights.end_date	Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project.
COPYRIGHT STATUS (B)	copyrightHolder	rights_holder.authorized_form_of_name (via rights.rights_holder_id)	Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project. Split rights holder name into separate "copyrightHolder" field. Requires creation of a new `rights_holder` row, or matching an existing rights holder name.
COPYRIGHT STATUS (B)	copyrightStatus	rights.copyright_status_id	Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project. Split rights holder name into separate "copyrightHolder" field. Will need to match `term_i18n.name` to get foreign key for `rights.copyright_status_id`.
DATE NOTE (B)	datesOfCreationNote	event_i18n.note, event.type_id = QubitTerm::CREATION_ID?	I think this only applies to "Creation" date, but need to check with Glenn
DATE OF CREATION (T)	datesOfCreation	event_i18n.dates, event.type_id = QubitTerm::CREATION_ID	This event row needs to link to the actor defined by Creators (actor_i18n.authorized_form_of_name) and Creator history (actor_i18n.history) columns, and be the same event row used for START YEAR (event.start_date) and END YEAR (event.end_date) columns.
EDITION (B)	edition	information_object_i18n.edition
END YEAR (T)	datesOfCreationEnd	event.end_date, event.type_id = QubitTerm::CREATION_ID	This event row needs be the same event row used for DATE OF CREATION (event_i18n.date) and NAME OF CREATOR (event.actor_id) columns.
GENERAL NOTE (B)	radNoteGeneral	note_i18n.content, note.type_id = ???	Each "Rad note" type is a separate row in the "note" table. See note for Rad note – conservation.
ITEM NUMBER (B)	identifier	information_object.identifier	CVA wants to prefix with a delimiter (e.g. ": ") to differentiate from other identifiers. Needs to be merged to single "identifier" column
LANG NOTE (B)	language	property.name = 'language', property_i18n.value	Language is written to property_i18n.value as a serialized PHP array of the form Array('en', 'fr', 'de') where the values are a two letter ISO639-1 language code. The property.source_culture = 'en' and property_i18n.culture = 'en' in all cases (e.g. the culture is always forced to English) and translation of language name is handled by symfony's format_language() helper function.
LOCATION (R)	physicalStorageName	physical_object_i18n.name	Copy value to Physical storage name and location fields
LOCATION (R)	physicalStorageLocation	physical_object_i18n.location	Copy value to Physical storage name and location fields
TYPE (R)	n/a	None	Do not migrate
NAME OF CREATOR (R)	creators	actor_i18n.authorized_form_of_name, event.type_id = QubitTerm::CREATION_ID	Creators are an `actor` linked to information_object via the many-to-many `event` table (event.information_object_id <-> event.actor_id). If multiple creators are indicated (we ask for a pipe "\|" delimiter for multi-value fields) then link, in order, to creator history (e.g. if 2 creators and 1 history, then link the first creator to the creator history). Use the same principle (e.g. match in order) for Creation dates. The related `actor` row needs to be the same one used for Creator history (actor_i18n.history) column. The `event` row needs to be the same one used for Creation date (event_i18n.dates), START YEAR (event.start_date) and END YEAR (event.end_date).
OLD PHOTO NO (B)	radNoteAlphaNumericDesignation	note_i18n.content, note.type_id = ???	Prefix with "Old Photo Number: " and map as "Rad note - alpha-numeric designation". Create a separate note entry or concatenate with other alpha-numeric notes?
OTHER CREATORS (R)	creators	actor_i18n.authorized_form_of_name	Creator names without history or creation dates. Concatenate to "Creators" column using pipe delimiter.
PART OF FILE (B)	n/a	None	Use to calculate "parent_id" value
PART OF FONDS (R)	n/a	None	Use to calculate "parent_id" value
PART OF SERIES (R)	n/a	None	Use to calculate "parent_id" value
PART OF SOUSFONDS (R)	n/a	None	Use to calculate "parent_id" value
PART OF SUBSERIES (R)	n/a	None	Use to calculate "parent_id" value
PHOTO CLIENT (B)	actorCommissionedBy	actor_i18n.authorized_form_of_name (via event)	For any "actor..." field use the string after "actor" (i.e. "Commissioned by") as the event label. Note in this case this will require creating a new term in the "Event type" taxonomy with term_i18n.name = 'Commissioned' a related scope note of 'Commissioned by' (See existing "event type" terms).
PHOTO GEOG (B)	placeAccessPoints	term_i18n.name (via object_term_relation table)	Place and subject access points live in the `term` (+ `term_i18n`) table, and have a many-to-many relationship to `information_object` via the `object_term_relation` table [object_term_relation.object_id (information_object.id), object_term_relation.term_id (term.id)]. The term.taxonomy_id field is required and must equal to QubitTaxonomy::PLACE_ID (42) for Places and QubitTaxonomy::SUBJECT_ID (35) for Subjects.
PHOTO SUBJECTS (B)	subjectAccessPoints	term_i18n.name (via object_term_relation table)	Place and subject access points live in the `term` (+ `term_i18n`) table, and have a many-to-many relationship to `information_object` via the `object_term_relation` table [object_term_relation.object_id (information_object.id), object_term_relation.term_id (term.id)]. The term.taxonomy_id field is required and must equal to QubitTaxonomy::PLACE_ID (42) for Places and QubitTaxonomy::SUBJECT_ID (35) for Subjects.
PHOTOGRAPHER (B)	actorPhotographedBy	note_i18n.content, note.type_id = ???	For any "actor..." field use the the text after "actor" (i.e. PhotographedBy) for the term name (Photographed) and scope note (Photographed by)
PHY COND NOTE (B)	physicalCharacteristics	information_object_i18n.physical_characteristics
PHYS CHAR NOTE (B)	physicalCharacteristics	information_object_i18n.physical_characteristics	Must be concatenated to physicalCharacteristics column
PRI REC NO (R)	identifier	information_object_i18n.identifier	Need to merge this with other identifier fields. Should only be applied to top-level descriptions (ignore at lower levels). If possible then automate creation of an auto-incremented integer number (within the current branch) as an identifier for lower-level descriptions.
PUB SERIES (B)	titleProperOfPublishersSeries	property_i18n.value	Create a new `property` row related to the information object (one-to-many) by property.object_id = information_object.id. Set property.name = 'titleProperOfPublishersSeries'.
RECORD_ID	sources AND UNIQUE_ID	information_object_i18n.sources	Prefix with "CS-Records ID: " in sources field. Use as "UNIQUE_ID" for mapping source to target rows.
RELATED RECORDS (R)	Related units of description	information_object_i18n.related_units_of_description
RRDB_* (B)	accessConditions	information_object_i18n.access_conditions	All of the RRDB_* (Rights) columns will be concatenated into a single 'access conditions' column according to rules laid out by CVA
SCALE (B)	statementOfScaleArchitectural	property_i18n.value	Write to "statement of scale (architectural)" or "statement of scale (cartographic)" column based on mediatype - property.name = 'statementOfScaleArchitectural' or property.name = 'statementOfScaleCartographic' respectively.
SCALE (B)	statementOfScaleCartographic	property_i18n.value	Write to "statement of scale (architectural)" or "statement of scale (cartographic)" column based on mediatype - property.name = 'statementOfScaleArchitectural' or property.name = 'statementOfScaleCartographic' respectively.
START YEAR (T)	datesOfCreationStart	event.start_date, event.type_id = QubitTerm::CREATION_ID	Use the same event row that is used for DATE OF CREATION (event_i18n.date) and NAME OF CREATOR (event.actor_id) columns.
THUMBNAIL (B)	n/a	None	Do not migrate
TIFF and JPEG Number (B)	n/a	None	Do not migrate
TIFFIMAGE (B)	n/a	None	Do not migrate
TITLE NOTE (B)	radTitleNoteContinuationOfTitle	note_i18n.content, note.type_id = ???	Split into "RAD note - continuation of title" or "RAD note - variations in title" based on prefix within field. Note type (note.type_id) is one of `SELECT id FROM term WHERE taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID;`
TITLE NOTE (B)	radTitleNoteVariationsInTitle	note_i18n.content, note.type_id = ???	Split into "RAD note - continuation of title" or "RAD note - variations in title" based on prefix within field. Note type (note.type_id) is one of `SELECT id FROM term WHERE taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID;`
TITLE SOURCE (R)	radTitleNoteSourceOfTitleProper	note_i18n.content, note.type_id = ???	New related note object with note.type_id = `SELECT id FROM term, term_i18n i18n WHERE term.id = i18n.id AND term.taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID AND i18n.culture='en' AND i18n.name = 'Source of title proper';`

AABC Toolkit: City of Vancouver data migration

Contents

Data sources

Content server data

Other sources

Process

Importing

Challenges

Data mapping

Private record data mapping

Account

Page

AABC Toolkit: City of Vancouver data migration

Contents

Data sources

Content server data

Other sources

Process

Importing

Challenges

Data mapping

Private record data mapping

Account

Page

Search