Difference between revisions of "Resources/CSV transformation"

From AtoM wiki
(Configuration)
 
(One intermediate revision by one other user not shown)
Line 5: Line 5:
  
 
<admonition type="seealso">
 
<admonition type="seealso">
* For more information on CSV import, see the CSV section of our [https://accesstomemory.org/docs User manual.]  
+
* For more information on CSV import, see the CSV section of our [https://accesstomemory.org/docs User manual.]
 
* For copies of the CSV import example templates, and change logs across major versions, see: [[Resources/CSV templates|CSV templates]]
 
* For copies of the CSV import example templates, and change logs across major versions, see: [[Resources/CSV templates|CSV templates]]
 
</admonition>
 
</admonition>
Line 49: Line 49:
 
MySQL is used as a working area for transformation data. This is done to avoid running out of RAM when working with large datasets.
 
MySQL is used as a working area for transformation data. This is done to avoid running out of RAM when working with large datasets.
  
Currently using the "root" MySQL user is required and the database used must be named "import", but this will be configurable at some point.
+
When running transformations you'll need to set a series of MySQL access-related environmental variables. In Linux/OS X:
 
 
Set the MYSQL_PASSWORD environmental variable to your password. In Linux/OS X:
 
  
 +
    export MYSQL_USER='youruser'
 
     export MYSQL_PASSWORD='yourpassword'
 
     export MYSQL_PASSWORD='yourpassword'
 
+
    export MYSQL_HOST='yourmysqlhost'
 +
    export MYSQL_DB='yourdbname'
  
 
===Writing transformation scripts===
 
===Writing transformation scripts===
Line 142: Line 142:
 
===How to create transformations===
 
===How to create transformations===
  
See example.transform.php for an incomplete example.
+
See <code>example.transform.php</code> for an incomplete example.
  
 
More documentation as time permits.
 
More documentation as time permits.
 
  
 
===Bugs/caveats===
 
===Bugs/caveats===

Latest revision as of 15:42, 1 May 2021

Main Page > Resources > Resources/CSV transformation

By creating a custom script, you can transform CSV files exported from other systems into CSV that is compatible with AtoM's CSV import tools.

Seealso

  • For more information on CSV import, see the CSV section of our User manual.
  • For copies of the CSV import example templates, and change logs across major versions, see: CSV templates

Transformation classes

AtoM comes with PHP classes that help in the creation of scripts to change CSV file data so it can be imported into AtoM using AtoM's CLI import tools.

These classes can be used to process a CSV file, doing the following:

  • Set child IDs to parent IDs (when importing hierarchical data)
  • Add columns
  • Rename columns
  • Change column data

For an example of what one of these scripts looks like, check out lib/task/import/example/transformation/example.transform.php.

You can run the example by entering:

   php symfony csv:custom-import \
     --import-definition=lib/task/import/example/transformation/example.transform.php \
     --output-file=/tmp/transformed.csv lib/task/import/example/transformation/example.csv

System requirements

  • AtoM
  • MySQL

Included PHP classes

The QubitCsvTransform class provides the base functionality for transformation.

The QubitCsvTransformFactory class provides a convenient way to create instances of QubitCsvTransform objects configured to meet common transformation needs.

Configuration

MySQL is used as a working area for transformation data. This is done to avoid running out of RAM when working with large datasets.

When running transformations you'll need to set a series of MySQL access-related environmental variables. In Linux/OS X:

   export MYSQL_USER='youruser'
   export MYSQL_PASSWORD='yourpassword'
   export MYSQL_HOST='yourmysqlhost'
   export MYSQL_DB='yourdbname'

Writing transformation scripts

When leveraging the QubitCsvTransformFactory class, data gets transformed in three steps:

  1. Any necessary manipulation of data is performed and, if importing hierarchical data, if a row is determined to be a parents its legacy ID is noted
  2. If importing hierarchical data, each row's parentId column is set and the row is stored in MySQL
  3. Data is retrieved from MySQL, sorted by the data by level of description, and written to CSV files (each CSV file contains up to 10,000 rows)

Parent/child association

If you're going to use parent/child association you'll need to either include "legacyId" and "parentId" columns in your input CSV or rename/add them using your transform logic (see the example transform which renames the input CSV's ID field and adds a parent ID column).

You'll need to create logic to determine a row's key, if it the row is a parent, and logic to lookup a parent's key, if the row is a child. See the example transform (until more documentation is added here).

Parents need to be added to CSV data before their children so during import the parents' Qubit IDs can be established and associated with subsequent children. Because of this, transformation logic needs to sort CSV data by level of description. The level of description for each CSV row should be specified in a column named, or renamed, levelOfDescription.

Sorting requires storing of CSV data and because we don't want to store data in RAM, we store it in MySQL.

Adding columns

The addColumns option can be used to add new columns to the CSV data.

For example:

   $setup = new QubitCsvTransformFactory(array(
     'cliOptions'              => $options,
     'machineName'             => 'accessions',
     'addColumns'              => array(
       'someColumn',
       'someOtherColumn'
     )
   ));

Transforming column data

The 'transformationLogic' option can be used to apply ad-hoc transformation logic to the CSV data.

   $transformLogic = function(&$self)
   {
     $self->columnValue('a column', $self->columnValue('another column'));
   };
   $setup = new QubitCsvTransformFactory(array(
     'cliOptions'              => $options,
     'machineName'             => 'accessions',
     'transformLogic'          => $transformLogic
   ));

Amalgamating columns

There's also a helper method for amalgamating multiple columns, allowing column data to be prefixed.

   $transformLogic = function(&$self)
   {
     $self->amalgamateColumns(array(
       'NOTES',
       'More:' => 'MORE NOTES'
     ), 'radNoteGeneral');
   };

Running a transformation

Run a transformation script from your Qubit root directory using:

    php symfony csv:custom-import \
     --import-definition=/path/to/transformation/script.php \
      --output-file=/tmp/transformed.csv \
      /path/to/transform_me.csv

Testing a transformation

One way to test transformations during development is to create a copy of the source data with all data rows removed except one. In this data row you put values that, when viewed in Qubit after an import of the transformed CSV file, will show you that data going through transformation is arriving in the intended Qubit fields.

How to create transformations

See example.transform.php for an incomplete example.

More documentation as time permits.

Bugs/caveats

CSV data runs through the fgetcsv function which automatically trims leading space from column values.