AABC Toolkit: City of Vancouver data migration
Main Page > Development > Development/Projects > Development/Projects/AABC Data Migration Toolkit > CVA
Contents
Data sources
Content server data
- approx. 125,000 Private Records (from private donors)
- approx. 87,000 Public Records (City created documents)
- approx. 9,700 Maps and Pamphlets
- approx. 3,350 Public documents (published materials issued by the City)
- 812 Accession records
- 224 Authority records (for Public Records only)
- 195 Authority records relationships
- 88 Ephemera records (private sources)
Other sources
- ??? ICA-AtoM 1.1 records
- approx. 70,000 digital objects (TIFF)
Process
For Content Server data:
- Exported data to CSV from Content Server,
- Transformed data to match ICA-AtoM RAD CSV template via "Pentaho Data Integration" (PDI) tool,
- imported into ICA-AtoM via CSV import command line (CLI) script.
For ICA-AtoM 1.1 data:
- Migrate and import via standard ICA-AtoM upgrade procedure.
For digital objects:
- Processed RAW TIFFs via Archivematica to produce DIPs (expand?),
- Then imported using CLI DIP import script
Importing
Import of the private records initially took 1 second per record on a conventional workstation, but an import speed of 0.12 seconds per record, initially, was achieved using a ~15GB RAM Rackspace Cloud instance (~$0.90/hour?) with an Ubuntu 11.04 LAMP stack. Import would slow down as it went, however, so the CSV dump was split into 10000 row chunks. The import slows down as it goes, but we were able to import 125K rows in ~12 hours after using a patch to disable nested set (hierarchy) updating. Post-import search indexing is very slow still.
Following are the steps to do the import (using Gnu Screen to keep the session alive).
- SSH to import server
- Download CSV data to /root
- Enter apt-get update
- Enter apt-get install -y tasksel subversion screen
- Enter tasksel install lamp-server
- Enter /usr/sbin/apache2ctl restart
- Enter cd /var/www
- Check out qubit Subversion trunk to qubit
- Enter chown -R www-data:www-data qubit
- Create qubit MySQL database
- Visit Qubit with browser to complete installation
- Enter cd qubit
- Enter screen
- Enter ./symfony csv:import path/to/first/import/file --rows-until-update=5 --error-log=/root/errors.log --source-name=source_name
- In the import command, "source_name" is used when recording mapping of legacy IDs to new Qubit IDs
Setup after initial import.
- Update nested set data: ./symfony propel:build-nested-set (took 3.33 minutes for 125K imported objects)
Challenges
- Private Records CSV export included 100+ columns, and over 1 million data points.
- Google Refine could not handle this much data, and could not be used for data transformation or normalization
Data mapping
LEGEND |
---|
"B" - [mostly] blank (haven't checked all records) |
"T" - human readable text |
"R" - machine readable (i.e. populated from a value list) |
Problem mapping |
Requires pre-processing |
Do not migrate |
Private record data mapping
Original CVA column | RAD CSV template column | Qubit column | Notes | |
---|---|---|---|---|
ACCESSION NUMBER (B) | accessionNumber | accession.identifier (via `relation`) | Add an accession record related via `relation` table (many-to-many) | |
ADMIN HIST OR BIO (T) | creatorsHistory | actor_i18n.history, event.type_id = QubitTerm::CREATION_ID | Creators are an `actor` linked to information_object via the many-to-many `event` table (event.information_object_id <-> event.actor_id). The event.type_id column is required and is defined in lib/model/QubitTerm.php as a class constant (i.e. CREATION_ID = 111). The related `actor` row needs to be the same one used for Creators (actor_i18n.authorized_form_of_name) column. The `event` row needs to be the same one used for Creation date (event_i18n.dates), START YEAR (event.start_date) and END YEAR (event.end_date). | |
ARCH ADDRESS (B) | scopeAndContent | information_object_i18n.scope_and_content | This column will be merged into the "Scope and content" column before import (See "Migration Instructions - CS Records to ICA AtoM - Public Records Export Set.XLS") | |
ARCHIVIST (RT) | revision | information_object_i18n.revision | Prefix with "Created by " | |
ARTIST (B) | nameAccessPoints | actor_i18n.authorized_form_of_name (relation.type_id = QubitTerm::NAME_ACCESS_POINT_ID) | Name access point is related to information object via many-to-many "relation" table (relation.subject_id = information_object.id, relation.object_id = actor.id). Need to test if an actor already exists with the specified name, if actor name DOES exist, then just create a new relation between actor and information object, if actor DOES NOT exist, then create a new actor row with actor_i18n.authorized_form_of_name equal to column value, then link via relation. | |
ASSOCIATED MATERIAL (B) | relatedUnitsOfDescription | information_object_i18n.related_units_of_description | ||
CONSERVATION NOTE (B) | radNoteConservation | note_i18n.content, note.type_id = ??? | Each "Rad note" type is a separate row in the "note" table, related to information object by (one-to-many) fk note.information_object_id. Getting the note.type_id (required) is tricky, you'll have to search like SELECT term.id FROM term, term_i18n i18n WHERE term.id = i18n.id and term.taxonomy_id = 51 and i18n.culture = 'en' and i18n.name = 'Conservation'; (N.B. QubitTaxonomy::RAD_NOTE_ID = 51)
| |
COPYRIGHT EXPIRES (B) | copyrightExpires | rights.end_date | Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project. | |
COPYRIGHT STATUS (B) | copyrightHolder | rights_holder.authorized_form_of_name (via rights.rights_holder_id) | Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project. Split rights holder name into separate "copyrightHolder" field. Requires creation of a new `rights_holder` row, or matching an existing rights holder name. | |
COPYRIGHT STATUS (B) | copyrightStatus | rights.copyright_status_id | Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project. Split rights holder name into separate "copyrightHolder" field. Will need to match `term_i18n.name` to get foreign key for `rights.copyright_status_id`. | |
DATE NOTE (B) | datesOfCreationNote | event_i18n.note, event.type_id = QubitTerm::CREATION_ID? | I think this only applies to "Creation" date, but need to check with Glenn | |
DATE OF CREATION (T) | datesOfCreation | event_i18n.dates, event.type_id = QubitTerm::CREATION_ID | This event row needs to link to the actor defined by Creators (actor_i18n.authorized_form_of_name) and Creator history (actor_i18n.history) columns, and be the same event row used for START YEAR (event.start_date) and END YEAR (event.end_date) columns. | |
EDITION (B) | edition | information_object_i18n.edition | ||
END YEAR (T) | datesOfCreationEnd | event.end_date, event.type_id = QubitTerm::CREATION_ID | This event row needs be the same event row used for DATE OF CREATION (event_i18n.date) and NAME OF CREATOR (event.actor_id) columns. | |
GENERAL NOTE (B) | radNoteGeneral | note_i18n.content, note.type_id = ??? | Each "Rad note" type is a separate row in the "note" table. See note for Rad note – conservation. | |
ITEM NUMBER (B) | identifier | information_object.identifier | CVA wants to prefix with a delimiter (e.g. ": ") to differentiate from other identifiers. Needs to be merged to single "identifier" column | |
LANG NOTE (B) | language | property.name = 'language', property_i18n.value | Language is written to property_i18n.value as a serialized PHP array of the form Array('en', 'fr', 'de') where the values are a two letter ISO639-1 language code. The property.source_culture = 'en' and property_i18n.culture = 'en' in all cases (e.g. the culture is always forced to English) and translation of language name is handled by symfony's format_language() helper function. | |
LOCATION (R) | physicalStorageName | physical_object_i18n.name | Copy value to Physical storage name and location fields | |
LOCATION (R) | physicalStorageLocation | physical_object_i18n.location | Copy value to Physical storage name and location fields | |
TYPE (R) | n/a | None | Do not migrate | |
NAME OF CREATOR (R) | creators | actor_i18n.authorized_form_of_name, event.type_id = QubitTerm::CREATION_ID | Creators are an `actor` linked to information_object via the many-to-many `event` table (event.information_object_id <-> event.actor_id). If multiple creators are indicated (we ask for a pipe "|" delimiter for multi-value fields) then link, in order, to creator history (e.g. if 2 creators and 1 history, then link the *first* creator to the creator history). Use the same principle (e.g. match in order) for Creation dates. The related `actor` row needs to be the same one used for Creator history (actor_i18n.history) column. The `event` row needs to be the same one used for Creation date (event_i18n.dates), START YEAR (event.start_date) and END YEAR (event.end_date). | |
OLD PHOTO NO (B) | radNoteAlphaNumericDesignation | note_i18n.content, note.type_id = ??? | Prefix with "Old Photo Number: " and map as "Rad note - alpha-numeric designation". Create a separate note entry or concatenate with other alpha-numeric notes? | |
OTHER CREATORS (R) | creators | actor_i18n.authorized_form_of_name | Creator names without history or creation dates. Concatenate to "Creators" column using pipe delimiter. | |
PART OF FILE (B) | n/a | None | Use to calculate "parent_id" value | |
PART OF FONDS (R) | n/a | None | Use to calculate "parent_id" value | |
PART OF SERIES (R) | n/a | None | Use to calculate "parent_id" value | |
PART OF SOUSFONDS (R) | n/a | None | Use to calculate "parent_id" value | |
PART OF SUBSERIES (R) | n/a | None | Use to calculate "parent_id" value | |
PHOTO CLIENT (B) | actorCommissionedBy | actor_i18n.authorized_form_of_name (via event) | For any "actor..." field use the string after "actor" (i.e. "Commissioned by") as the event label. Note in this case this will require creating a new term in the "Event type" taxonomy with term_i18n.name = 'Commissioned' a related scope note of 'Commissioned by' (See existing "event type" terms). | |
PHOTO GEOG (B) | placeAccessPoints | term_i18n.name (via object_term_relation table) | Place and subject access points live in the `term` (+ `term_i18n`) table, and have a many-to-many relationship to `information_object` via the `object_term_relation` table [object_term_relation.object_id (information_object.id), object_term_relation.term_id (term.id)]. The term.taxonomy_id field is required and must equal to QubitTaxonomy::PLACE_ID (42) for Places and QubitTaxonomy::SUBJECT_ID (35) for Subjects. | |
PHOTO SUBJECTS (B) | subjectAccessPoints | term_i18n.name (via object_term_relation table) | Place and subject access points live in the `term` (+ `term_i18n`) table, and have a many-to-many relationship to `information_object` via the `object_term_relation` table [object_term_relation.object_id (information_object.id), object_term_relation.term_id (term.id)]. The term.taxonomy_id field is required and must equal to QubitTaxonomy::PLACE_ID (42) for Places and QubitTaxonomy::SUBJECT_ID (35) for Subjects. | |
PHOTOGRAPHER (B) | actorPhotographedBy | note_i18n.content, note.type_id = ??? | For any "actor..." field use the the text after "actor" (i.e. PhotographedBy) for the term name (Photographed) and scope note (Photographed by) | |
PHY COND NOTE (B) | physicalCharacteristics | information_object_i18n.physical_characteristics | ||
PHYS CHAR NOTE (B) | physicalCharacteristics | information_object_i18n.physical_characteristics | Must be concatenated to physicalCharacteristics column | |
PRI REC NO (R) | identifier | information_object_i18n.identifier | Need to merge this with other identifier fields. Should only be applied to top-level descriptions (ignore at lower levels). If possible then automate creation of an auto-incremented integer number (within the current branch) as an identifier for lower-level descriptions. | |
PUB SERIES (B) | titleProperOfPublishersSeries | property_i18n.value | Create a new `property` row related to the information object (one-to-many) by property.object_id = information_object.id. Set property.name = 'titleProperOfPublishersSeries'. | |
RECORD_ID | sources AND UNIQUE_ID | information_object_i18n.sources | Prefix with "CS-Records ID: " in sources field. Use as "UNIQUE_ID" for mapping source to target rows. | |
RELATED RECORDS (R) | Related units of description | information_object_i18n.related_units_of_description | ||
RRDB_* (B) | accessConditions | information_object_i18n.access_conditions | All of the RRDB_* (Rights) columns will be concatenated into a single 'access conditions' column according to rules laid out by CVA | |
SCALE (B) | statementOfScaleArchitectural | property_i18n.value | Write to "statement of scale (architectural)" or "statement of scale (cartographic)" column based on mediatype - property.name = 'statementOfScaleArchitectural' or property.name = 'statementOfScaleCartographic' respectively. | |
SCALE (B) | statementOfScaleCartographic | property_i18n.value | Write to "statement of scale (architectural)" or "statement of scale (cartographic)" column based on mediatype - property.name = 'statementOfScaleArchitectural' or property.name = 'statementOfScaleCartographic' respectively. | |
START YEAR (T) | datesOfCreationStart | event.start_date, event.type_id = QubitTerm::CREATION_ID | Use the same event row that is used for DATE OF CREATION (event_i18n.date) and NAME OF CREATOR (event.actor_id) columns. | |
THUMBNAIL (B) | n/a | None | Do not migrate | |
TIFF and JPEG Number (B) | n/a | None | Do not migrate | |
TIFFIMAGE (B) | n/a | None | Do not migrate | |
TITLE NOTE (B) | radTitleNoteContinuationOfTitle | note_i18n.content, note.type_id = ??? | Split into "RAD note - continuation of title" or "RAD note - variations in title" based on prefix within field. Note type (note.type_id) is one of SELECT id FROM term WHERE taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID;
| |
TITLE NOTE (B) | radTitleNoteVariationsInTitle | note_i18n.content, note.type_id = ??? | Split into "RAD note - continuation of title" or "RAD note - variations in title" based on prefix within field. Note type (note.type_id) is one of SELECT id FROM term WHERE taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID;
| |
TITLE SOURCE (R) | radTitleNoteSourceOfTitleProper | note_i18n.content, note.type_id = ??? | New related note object with note.type_id = SELECT id FROM term, term_i18n i18n WHERE term.id = i18n.id AND term.taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID AND i18n.culture='en' AND i18n.name = 'Source of title proper';
|