(Quick Reference)

3 Ingest - Reference Documentation

Authors: Lucien van Wouw

Version: 1.4

3 Ingest

The ingest is a sequence of tasks whereby the submission package is transformed into the archival package. The latter is then persisted into the archival storage. These tasks are performed by various services, each performing operations following their own logic.

3.1 Ingest instruction service

This service will become available during the submission stage, once the Instruction validation service confirmed all files to be in pristine condition.The service initiates each individually declared file in the instruction, thus kick starting the file's lifecycle. The requirements this service adheres to is for an addition, delete or update action:
  1. Select each stagingfile element in the instruction.
  2. Verify if the element has been validated.
  3. Set the first task to start the cycle of the file's ingest career.
    1. If all stagingfile elements are set return success
    2. If not, unset all tasks and throw an error

3.2 Workflow controller service

The workflow controller is the brain of the object repository.

Directly after the Instruction Ingest procedure was successful; the controller will append a stack of logically ordered tasks to each stagingfile element within the instruction. For example to ingest the master file, then to bind a persistent identifier to the dissemination package resolve url, then to create derivatives. These stacked tasks are picked up one-by-one by the same workflow controller. For each task it will send a message to a designated message queue.

Each Ingest service listens to their own designated message queue. It will execute it's own logic after a task is taken from the message queue. It will report the task status, success and failure back to the instruction's task element. There the workflow controller can pick it up again and decide what to do next.

Once all the tasks are processed, the workflow will end for the staged file. When all staged files tasks were completed, the ingest is completed - with or without any remaining issues.

The administration panel shows the result of the ingest to the content manager.

3.3 Ingest master service

The Ingest master service is specifically intended to persist master files. It is the first task to do when an ingest starts… when the plan is set so to ingest new or updated master files that is.

The requirements this service adheres to is for an addition or update action:

  1. look for the file on the staging area
    1. if found; add the file to the archival storage
      1. if not found; update the preservation description information in the archival package
      2. if no such information was known ( no persistent identifier ) throw an error
    2. compare the checksum between the submission package and the archival package
      1. when these match, add the preservation description information to the archival package.
      2. if there is a mismatch, throw an error
  2. return success

For a delete action:

  1. remove the content data objects ( the master file )
  2. check the removal
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  3. remove the preservation description information
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  4. return success

3.4 PID services

The Persistent identifier services will bind the fixity information to the resolve URLs of what is the dissemination package. This will produce stable, resolvable URLs to the object repository's consumers.

The PID calling services are only available if the content producer has outsourced - or hosts - a PID webservice. The PID webservice offers SOAP methods to manage the Handle System(r) resolution technology. As a PID webservice account is associated with a webservice key to operate it, the producer needs to add that key to their Profile.

Producers that have their own resolver technology should make such bindings themselves using the existing Dissemination conventions available.

3.4.1 PID bind file service

The PID file service will bind persistent identifiers to resolve URLs of stored file's. It will use the Handle System's capacity to handle multiple resolve URLs per PID. It will follow a convention to make such binds. For the PID:

http://hdl.handle.net/[persistent identifier]?locatt=view:[dissemination type]

and the bind to the resolve URL:

http://[object repository domain]/file/[dissemination type]/[persistent identifier]

The locatt@ is a qualifier that specifies the dissemination type. For as the dissemination package may offer several views of the data, this qualifier will lead the consumer to them. For example, if this was your PID:

12345/1

then the persistent URLs of the master and derivatives become by convention:

viewpersistent URLResolve Url
metadatahttp://hdl.handle.net/12345/1http://disseminate.objectrepository.org/metadata/12345/1
masterhttp://hdl.handle.net/12345/1?locatt=view:masterhttp://disseminate.objectrepository.org/file/master/12345/1
level 1 derivativehttp://hdl.handle.net/12345/1?locatt=view:level1http://disseminate.objectrepository.org/file/level1/12345/1
level 2 derivativehttp://hdl.handle.net/12345/1?locatt=view:level2http://disseminate.objectrepository.org/file/level2/12345/1
level 3 derivativehttp://hdl.handle.net/12345/1?locatt=view:level3http://disseminate.objectrepository.org/file/level2/12345/1

The requirements this service adheres to is for an addition or update action:

  1. Construct a PID webservice SOAP request. This request contains the persistent identifier; and the qualifiers per dissemination resolve URLs.
  2. Call the PID webservice using the PID webservice key.
  3. Check for the PID webservice response
    1. If an invalid or response, service unavailability or a failure message is in the response, then throw an error
  4. return success

For a delete action:

  1. remove the persistent identifier from the PID webservice
  2. Check for the PID webservice response
    1. If an invalid or response, service unavailability or a failure message is in the response, then throw an error
  3. return success

3.4.2 PID bind OBJID service

The PID OBJID service will bind a compound object's persistent identifier (the OBJID ) to the Mets and PDF dissemination services. These are exposed via conventional resolve URLs that when invoked produce METS document and PDF file renderings. As an added feature it will point to the derivative of the first file the compound object consists of. This latter will be useful to offer a preview image for the compound object.

The persistent identifiers convention for OBJIDs is:

http://hdl.handle.net/[persistent identifier]?locatt=view:[dissemination type]

And the resolve URL convention:

http://[object repository domain]/[mets or pdf]/[persistent identifier]

viewpersistent URLResolve Url
metshttp://hdl.handle.net/12345/my-object-idhttp://disseminate.objectrepository.org/mets/12345/my-object-id
masterhttp://hdl.handle.net/12345/my-object-id?locatt=view:masterhttp://disseminate.objectrepository.org/mets/12345/my-object-id
pdfhttp://hdl.handle.net/12345/my-object-id?locatt=view:pdfhttp://disseminate.objectrepository.org/pdf/12345/my-object-id
level 1 derivativehttp://hdl.handle.net/12345/my-object-id?locatt=view:level1http://disseminate.objectrepository.org/file/level1/12345/1.1
level 2 derivativehttp://hdl.handle.net/12345/my-object-id?locatt=view:level2http://disseminate.objectrepository.org/file/level2/12345/1.1
level 3 derivativehttp://hdl.handle.net/12345/my-object-id?locatt=view:level3http://disseminate.objectrepository.org/file/level2/12345/1.1

3.5 Ingest derivative services

The object repository derivative services will generate - on command and if possible - three types of preview material which is intended as display to the consumer. There range from "light weight" presentations to "heavy" normalizations of the original master data. Derivative production may involve a simple reduction; but also the introduction of a completely new content type.

The types are:

  • Level 1 derivatives: these are normalisations. Or near-enough reproductions of the master files.
  • level 2 derivatives: medium sized, fit-to-screen sized content. That still give a good insight of the details of the master files.
  • level 3 derivatives: small, quick-peek, thumbnail like material.

Both master and derivatives become part of the archival package, be it that only the master has the intended durable status. It goes without saying that derivative production is only possible after a master is persisted and part of the archival package.

Supplying the submission package of custom derivatives is also possible. The precise interpretation of what a level 1, 2 or 3 is therefor at the discretion of the content producer.

Derivative services will produce a derivative when:

  • No derivative of that level exists;
  • Or when the processing instruction explicitly states to replace existing derivatives.
  • If not, the action is skipped.

3.5.1 Custom derivative service

At any time the content producer can offer their own custom derivatives. These can be of any content type or file size. It is advisable to remain consistent with regard to the to the level 1, 2 and 3 derivative types.

The requirements this service adheres to is for an addition or update action:

  1. Look for a custom placed derivative in the submission package
  2. if not found, return success
  3. if found, determine the derivative level
  4. calculate a md5 checksum
  5. add the file to the archival storage
    1. compare the checksum between the submission package and the archival package
      1. when these match, add the preservation description information to the archival package.
      2. if there is a mismatch, throw an error
  6. return success

For a delete action:

  1. remove the content data objects ( the derivative file )
  2. check the removal
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  3. remove the preservation description information
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  4. return success

3.5.2 Image derivative service

It attempts to create three levels for images and first page pdf documents:
  • level 1: high print quality; standardization to pdf
  • level 2: medium screen quality; length and width reduction
  • level 3: small, thumbnail quality; length and width and resolution reduction

The requirements this service adheres to is for an addition or update action:

  1. Obtain the master file; or if available a suitable level 1 derivative from the archival storage
  2. Produce a derivative using ImageMagick
  3. calculate a md5 checksum
  4. add the file to the archival storage
    1. compare the checksum between the submission package and the archival package
      1. when these match, add the preservation description information to the archival package.
      2. if there is a mismatch, throw an error
  5. return success

For a delete action:

  1. remove the content data objects ( the derivative file )
  2. check the removal
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  3. remove the preservation description information
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  4. return success

3.5.3 Audio-video derivative service

The Audio and Video derivative service will turn the master video or audio file into a derivative.

Audio additions

For master audio, the service will produce a level 1 type, high quality mp3. The requirements this service adheres to is to:
  1. retrieve the master audio file
  2. use ffmpeg software to create the audio file in the desired mp3 content type
  3. add the derivative to the archival package
  4. when issues arrive, throw an error
  5. otherwise return success

Video additions

For video, three derivative levels are produced:

Level 1

A high quality h264 AAC (mp4) movie. The requirements this service adheres to is for an addition or update action:
  1. retrieve the master video file
  2. use ffmpeg software to create the video file in the desired mp4 content type. No resolution, frame or height and width are altered.
  3. calculate a md5 checksum
  4. add the file to the archival storage
    1. compare the checksum between the submission package and the archival package
      1. when these match, add the preservation description information to the archival package.
      2. if there is a mismatch, throw an error
  5. return success

Level 2

A montage of 16 stills taken from the entire movie's range. The requirements this service adheres to is for an addition or update action:
  1. retrieve the level 1 derivative video file. Or if not available the master.
  2. use ffmpeg software to create 16 video stills; the height is set to about 800px each.
  3. use ImageMagick to collate these images into one image/png file
  4. calculate a md5 checksum
  5. add the file to the archival storage
    1. compare the checksum between the submission package and the archival package
      1. when these match, add the preservation description information to the archival package.
      2. if there is a mismatch, throw an error
  6. return success

Level 3

A small, thumbnail quality taken from the middle of the movie. The requirements this service adheres to is for an addition or update action:
  1. retrieve the level 1 derivative video file. Or if not available the master.
  2. use ffmpeg software to create a single video stills from an estimated "middle" of the movie.
  3. use ImageMagick to scale the image into one image/png file
  4. calculate a md5 checksum
  5. add the file to the archival storage
    1. compare the checksum between the submission package and the archival package
      1. when these match, add the preservation description information to the archival package.
      2. if there is a mismatch, throw an error
  6. return success

Removal

The requirements this service adheres to is for a removal action is:
  1. remove the content data objects ( the derivative file )
  2. check the removal
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  3. remove the preservation description information
    1. verify the removal action. if it succeeded, continue
    2. if removal failed, throw an error
  4. return success