AABC Toolkit: City of Vancouver data migration

From AtoM wiki
Revision as of 17:49, 30 July 2015 by Dan (talk | contribs) (Fix spacing, add footer navigation links)

Main Page > Development > Development/Projects > Development/Projects/AABC Data Migration Toolkit > CVA

Data sources

Content server data

  • approx. 125,000 Private Records (from private donors)
  • approx. 87,000 Public Records (City created documents)
  • approx. 9,700 Maps and Pamphlets
  • approx. 3,350 Public documents (published materials issued by the City)
  • 812 Accession records
  • 224 Authority records (for Public Records only)
  • 195 Authority records relationships
  • 88 Ephemera records (private sources)

Other sources

  •  ??? ICA-AtoM 1.1 records
  • approx. 70,000 digital objects (TIFF)

Process

For Content Server data:

  • Exported data to CSV from Content Server,
  • Transformed data to match ICA-AtoM RAD CSV template via "Pentaho Data Integration" (PDI) tool,
  • imported into ICA-AtoM via CSV import command line (CLI) script.

For ICA-AtoM 1.1 data:

  • Migrate and import via standard ICA-AtoM upgrade procedure.

For digital objects:

  • Processed RAW TIFFs via Archivematica to produce DIPs (expand?),
  • Then imported using CLI DIP import script

Importing

Import of the private records initially took 1 second per record on a conventional workstation, but an import speed of 0.12 seconds per record, initially, was achieved using a ~15GB RAM Rackspace Cloud instance (~$0.90/hour?) with an Ubuntu 11.04 LAMP stack. Import would slow down as it went, however, so the CSV dump was split into 10000 row chunks. The import slows down as it goes, but we were able to import 125K rows in ~12 hours after using a patch to disable nested set (hierarchy) updating. Post-import search indexing is very slow still.

Following are the steps to do the import (using Gnu Screen to keep the session alive).

  1. SSH to import server
  2. Download CSV data to /root
  3. Enter apt-get update
  4. Enter apt-get install -y tasksel subversion screen
  5. Enter tasksel install lamp-server
  6. Enter /usr/sbin/apache2ctl restart
  7. Enter cd /var/www
  8. Check out qubit Subversion trunk to qubit
  9. Enter chown -R www-data:www-data qubit
  10. Create qubit MySQL database
  11. Visit Qubit with browser to complete installation
  12. Enter cd qubit
  13. Enter screen
  14. Enter ./symfony csv:import path/to/first/import/file --rows-until-update=5 --error-log=/root/errors.log --source-name=source_name
  15. In the import command, "source_name" is used when recording mapping of legacy IDs to new Qubit IDs

Setup after initial import.

  1. Update nested set data: ./symfony propel:build-nested-set (took 3.33 minutes for 125K imported objects)

Challenges

  • Private Records CSV export included 100+ columns, and over 1 million data points.
  • Google Refine could not handle this much data, and could not be used for data transformation or normalization

Data mapping

LEGEND
"B" - [mostly] blank (haven't checked all records)
"T" - human readable text
"R" - machine readable (i.e. populated from a value list)
Problem mapping
Requires pre-processing
Do not migrate

Private record data mapping

Original CVA column RAD CSV template column Qubit column Notes
ACCESSION NUMBER (B) accessionNumber accession.identifier (via `relation`) Add an accession record related via `relation` table (many-to-many)
ADMIN HIST OR BIO (T) creatorsHistory actor_i18n.history, event.type_id = QubitTerm::CREATION_ID Creators are an `actor` linked to information_object via the many-to-many `event` table (event.information_object_id <-> event.actor_id). The event.type_id column is required and is defined in lib/model/QubitTerm.php as a class constant (i.e. CREATION_ID = 111). The related `actor` row needs to be the same one used for Creators (actor_i18n.authorized_form_of_name) column. The `event` row needs to be the same one used for Creation date (event_i18n.dates), START YEAR (event.start_date) and END YEAR (event.end_date).
ARCH ADDRESS (B) scopeAndContent information_object_i18n.scope_and_content This column will be merged into the "Scope and content" column before import (See "Migration Instructions - CS Records to ICA AtoM - Public Records Export Set.XLS")
ARCHIVIST (RT) revision information_object_i18n.revision Prefix with "Created by "
ARTIST (B) nameAccessPoints actor_i18n.authorized_form_of_name (relation.type_id = QubitTerm::NAME_ACCESS_POINT_ID) Name access point is related to information object via many-to-many "relation" table (relation.subject_id = information_object.id, relation.object_id = actor.id). Need to test if an actor already exists with the specified name, if actor name DOES exist, then just create a new relation between actor and information object, if actor DOES NOT exist, then create a new actor row with actor_i18n.authorized_form_of_name equal to column value, then link via relation.
ASSOCIATED MATERIAL (B) relatedUnitsOfDescription information_object_i18n.related_units_of_description
CONSERVATION NOTE (B) radNoteConservation note_i18n.content, note.type_id = ??? Each "Rad note" type is a separate row in the "note" table, related to information object by (one-to-many) fk note.information_object_id. Getting the note.type_id (required) is tricky, you'll have to search like SELECT term.id FROM term, term_i18n i18n WHERE term.id = i18n.id and term.taxonomy_id = 51 and i18n.culture = 'en' and i18n.name = 'Conservation'; (N.B. QubitTaxonomy::RAD_NOTE_ID = 51)
COPYRIGHT EXPIRES (B) copyrightExpires rights.end_date Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project.
COPYRIGHT STATUS (B) copyrightHolder rights_holder.authorized_form_of_name (via rights.rights_holder_id) Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project. Split rights holder name into separate "copyrightHolder" field. Requires creation of a new `rights_holder` row, or matching an existing rights holder name.
COPYRIGHT STATUS (B) copyrightStatus rights.copyright_status_id Not part of RAD. Would be preferable to include copyright information in a "Rights" import module, but out-of-scope for current project. Split rights holder name into separate "copyrightHolder" field. Will need to match `term_i18n.name` to get foreign key for `rights.copyright_status_id`.
DATE NOTE (B) datesOfCreationNote event_i18n.note, event.type_id = QubitTerm::CREATION_ID? I think this only applies to "Creation" date, but need to check with Glenn
DATE OF CREATION (T) datesOfCreation event_i18n.dates, event.type_id = QubitTerm::CREATION_ID This event row needs to link to the actor defined by Creators (actor_i18n.authorized_form_of_name) and Creator history (actor_i18n.history) columns, and be the same event row used for START YEAR (event.start_date) and END YEAR (event.end_date) columns.
EDITION (B) edition information_object_i18n.edition
END YEAR (T) datesOfCreationEnd event.end_date, event.type_id = QubitTerm::CREATION_ID This event row needs be the same event row used for DATE OF CREATION (event_i18n.date) and NAME OF CREATOR (event.actor_id) columns.
GENERAL NOTE (B) radNoteGeneral note_i18n.content, note.type_id = ??? Each "Rad note" type is a separate row in the "note" table. See note for Rad note – conservation.
ITEM NUMBER (B) identifier information_object.identifier CVA wants to prefix with a delimiter (e.g. ": ") to differentiate from other identifiers. Needs to be merged to single "identifier" column
LANG NOTE (B) language property.name = 'language', property_i18n.value Language is written to property_i18n.value as a serialized PHP array of the form Array('en', 'fr', 'de') where the values are a two letter ISO639-1 language code. The property.source_culture = 'en' and property_i18n.culture = 'en' in all cases (e.g. the culture is always forced to English) and translation of language name is handled by symfony's format_language() helper function.
LOCATION (R) physicalStorageName physical_object_i18n.name Copy value to Physical storage name and location fields
LOCATION (R) physicalStorageLocation physical_object_i18n.location Copy value to Physical storage name and location fields
TYPE (R) n/a None Do not migrate
NAME OF CREATOR (R) creators actor_i18n.authorized_form_of_name, event.type_id = QubitTerm::CREATION_ID Creators are an `actor` linked to information_object via the many-to-many `event` table (event.information_object_id <-> event.actor_id). If multiple creators are indicated (we ask for a pipe "|" delimiter for multi-value fields) then link, in order, to creator history (e.g. if 2 creators and 1 history, then link the *first* creator to the creator history). Use the same principle (e.g. match in order) for Creation dates. The related `actor` row needs to be the same one used for Creator history (actor_i18n.history) column. The `event` row needs to be the same one used for Creation date (event_i18n.dates), START YEAR (event.start_date) and END YEAR (event.end_date).
OLD PHOTO NO (B) radNoteAlphaNumericDesignation note_i18n.content, note.type_id = ??? Prefix with "Old Photo Number: " and map as "Rad note - alpha-numeric designation". Create a separate note entry or concatenate with other alpha-numeric notes?
OTHER CREATORS (R) creators actor_i18n.authorized_form_of_name Creator names without history or creation dates. Concatenate to "Creators" column using pipe delimiter.
PART OF FILE (B) n/a None Use to calculate "parent_id" value
PART OF FONDS (R) n/a None Use to calculate "parent_id" value
PART OF SERIES (R) n/a None Use to calculate "parent_id" value
PART OF SOUSFONDS (R) n/a None Use to calculate "parent_id" value
PART OF SUBSERIES (R) n/a None Use to calculate "parent_id" value
PHOTO CLIENT (B) actorCommissionedBy actor_i18n.authorized_form_of_name (via event) For any "actor..." field use the string after "actor" (i.e. "Commissioned by") as the event label. Note in this case this will require creating a new term in the "Event type" taxonomy with term_i18n.name = 'Commissioned' a related scope note of 'Commissioned by' (See existing "event type" terms).
PHOTO GEOG (B) placeAccessPoints term_i18n.name (via object_term_relation table) Place and subject access points live in the `term` (+ `term_i18n`) table, and have a many-to-many relationship to `information_object` via the `object_term_relation` table [object_term_relation.object_id (information_object.id), object_term_relation.term_id (term.id)]. The term.taxonomy_id field is required and must equal to QubitTaxonomy::PLACE_ID (42) for Places and QubitTaxonomy::SUBJECT_ID (35) for Subjects.
PHOTO SUBJECTS (B) subjectAccessPoints term_i18n.name (via object_term_relation table) Place and subject access points live in the `term` (+ `term_i18n`) table, and have a many-to-many relationship to `information_object` via the `object_term_relation` table [object_term_relation.object_id (information_object.id), object_term_relation.term_id (term.id)]. The term.taxonomy_id field is required and must equal to QubitTaxonomy::PLACE_ID (42) for Places and QubitTaxonomy::SUBJECT_ID (35) for Subjects.
PHOTOGRAPHER (B) actorPhotographedBy note_i18n.content, note.type_id = ??? For any "actor..." field use the the text after "actor" (i.e. PhotographedBy) for the term name (Photographed) and scope note (Photographed by)
PHY COND NOTE (B) physicalCharacteristics information_object_i18n.physical_characteristics
PHYS CHAR NOTE (B) physicalCharacteristics information_object_i18n.physical_characteristics Must be concatenated to physicalCharacteristics column
PRI REC NO (R) identifier information_object_i18n.identifier Need to merge this with other identifier fields. Should only be applied to top-level descriptions (ignore at lower levels). If possible then automate creation of an auto-incremented integer number (within the current branch) as an identifier for lower-level descriptions.
PUB SERIES (B) titleProperOfPublishersSeries property_i18n.value Create a new `property` row related to the information object (one-to-many) by property.object_id = information_object.id. Set property.name = 'titleProperOfPublishersSeries'.
RECORD_ID sources AND UNIQUE_ID information_object_i18n.sources Prefix with "CS-Records ID: " in sources field. Use as "UNIQUE_ID" for mapping source to target rows.
RELATED RECORDS (R) Related units of description information_object_i18n.related_units_of_description
RRDB_* (B) accessConditions information_object_i18n.access_conditions All of the RRDB_* (Rights) columns will be concatenated into a single 'access conditions' column according to rules laid out by CVA
SCALE (B) statementOfScaleArchitectural property_i18n.value Write to "statement of scale (architectural)" or "statement of scale (cartographic)" column based on mediatype - property.name = 'statementOfScaleArchitectural' or property.name = 'statementOfScaleCartographic' respectively.
SCALE (B) statementOfScaleCartographic property_i18n.value Write to "statement of scale (architectural)" or "statement of scale (cartographic)" column based on mediatype - property.name = 'statementOfScaleArchitectural' or property.name = 'statementOfScaleCartographic' respectively.
START YEAR (T) datesOfCreationStart event.start_date, event.type_id = QubitTerm::CREATION_ID Use the same event row that is used for DATE OF CREATION (event_i18n.date) and NAME OF CREATOR (event.actor_id) columns.
THUMBNAIL (B) n/a None Do not migrate
TIFF and JPEG Number (B) n/a None Do not migrate
TIFFIMAGE (B) n/a None Do not migrate
TITLE NOTE (B) radTitleNoteContinuationOfTitle note_i18n.content, note.type_id = ??? Split into "RAD note - continuation of title" or "RAD note - variations in title" based on prefix within field. Note type (note.type_id) is one of SELECT id FROM term WHERE taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID;
TITLE NOTE (B) radTitleNoteVariationsInTitle note_i18n.content, note.type_id = ??? Split into "RAD note - continuation of title" or "RAD note - variations in title" based on prefix within field. Note type (note.type_id) is one of SELECT id FROM term WHERE taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID;
TITLE SOURCE (R) radTitleNoteSourceOfTitleProper note_i18n.content, note.type_id = ??? New related note object with note.type_id = SELECT id FROM term, term_i18n i18n WHERE term.id = i18n.id AND term.taxonomy_id = QubitTaxonomy::RAD_TITLE_NOTE_ID AND i18n.culture='en' AND i18n.name = 'Source of title proper';