LB purge and export to JP

From EgeeWiki

Contents

Purge

Primary purpose of the LB purge operation (invoked by the glite-lb-purge command) is removal of aged data from LB database. This is necessary in production in order to prevent ever-increasing database and sustain reasonable performance of the server. Therefore the purge should be invoked periodically.

The purge operation has additional important "side effect" -- dumping the purged data into a plain text file. These dumps can be archived as is or uploaded to Job Provenance.

glite-lb-purge

The purge command (glite-lb-purge) is a possibly remote client which communicates via SSL; therefore administrator's credentials are needed. The utility has to be provided with the following information:

  • remote bkserver
  • timeouts (how old jobs should be purged) for each job state: aborted, cleared, cancelled, other
  • optionally jobids set for purging or dumping

See also LBServerPurge.

LB server

Dump/purge is then done locally on LB server:

  • directories are configurable via server options -S (--purge-prefix); it should not be confused with -D (--dump-prefix), used for different purpose of regular backup dump

In general, it is feasible to apply different purging timeouts for different job states:

  1. Cleared: the most obvious are jobs in this normal terminal state which can be purged quite soon
  2. Aborted, Cancleled: abnormal terminal states, it makes sense to keep LB data longer for post-mortem analysis
  3. Other: we should consider hanging jobs remained in other non-terminal states too, delay of purge should be long enough (it's possible some jobs will go on after some drop-out)

Purging script

glite-lb-export.sh script is a wrapper intended to be run as a cron job regularly. It calls glite-lb-purge command, providing it with appropriate options and environment.

Purge certification scenario

Standard configuration, just with tuned with quicker purge timeouts.

Installed RPMS

  • glite-lb-server
  • glite-lb-client
  • and dependencies:
    • glite-lb-common
    • glite-security-gsoap-plugin
    • glite-lb-ws-interface
    • glite-lb-server-bones
    • glite-security-voms-api-cpp
    • ...

Certifiaction steps

  1. standard instalation (with cron purging switched on, without export to JP PS)
  2. smaller purge timeouts, in environment for example
    export DELAY=60; export GLITE_LB_EXPORT_PURGE_ARGS="--cleared $((DELAY))s --aborted $((2*DELAY))s --cancelled $((3*DELAY))s --other $((4*DELAY))s"
  3. log bunch of the jobs of all types (aborted, cleared, cancelled, other - eg. waiting) ==> 4 jobid lists:
    • jobids1[aborted]
    • jobids1[cleared]
    • jobids1[cancelled]
    • jobids1[other]
  4. check the jobs vanish after DELAY, 2*DELAY, 3*DELAY and 4*DELAY
  5. check if the temporary directories are empty (default in $GLITE_LOCATION_VAR purge/, dump/; jpreg/, jpdump/, lbexport/ aren't used here)
  6. repeat several times since 2

Thorough purge test

Described here for completeness, destructive, needs a dedicated database. Implemented in org.glite.lb.client/examples/purge_test.

  1. needed cron purging switched off
  2. purge LB database
  3. log bunch of the jobs of all types (aborted, cleared, cancelled, other - eg. waiting) ==> 4 jobid lists:
    • jobids1[aborted]
    • jobids1[cleared]
    • jobids1[cancelled]
    • jobids1[other]
  4. sleep DELAY
  5. log another bunch of the jobs of all types ==> 4 jobid lists:
    • jobids2[aborted]
    • jobids2[cleared]
    • jobids2[cancelled]
    • jobids2[other]
  6. run dry purge with DELAY/2 timeout separately for each job type, something like:
    glite-lb-purge --server $server --dry-run --return-list --${type}=${half}s| grep '^https://'
    • returned lists must be jobids1[*]
    • glite-lb-user_jobs returns all jobs
  7. run dry purge with timeout 0 separately for each job type, something like:
    glite-lb-purge --server $server --dry-run --return-list --${type}=0s | grep '^https://'
    returned lists must be merged jobids1[*] and jobids2[*]
  8. without timeout args nothing is purged:
     glite-lb-purge --server $server --dry-run --return-list | grep '^https://
    • returns nothing
    • glite-lb-user_jobs returns all jobs
  9. purge the first bunch with DELAY/2 timeout:
    glite-lb-purge --server $server --server-dump --aborted=${half}s --cleared=${half}s --cancelled=${half}s --other=${half}s | grep '^Server dump:
    • returns jobids1[*]
    • glite-lb-user_jobs returns only jobs in jobids2[*]
  10. purge the rest:
    glite-lb-purge --server $server --server-dump --aborted=0 --cleared=0 --cancelled=0 --other=0 | grep '^Server dump:
    • returns jobids2[*]
    • glite-lb-user_jobs returns nothing
  11. nothing should left:
    glite-lb-purge --server $server --return-list --dry-run --aborted=0 --cleared=0 --cancelled=0 --other=0 | grep '^https://
    returns no jobs
  12. repeat several times since 3

Export to the Job Provenance

TODO needs a bit polishing, dependency on jp-client (swich on/off glite-jp-importer too via YAIM config...)

Transfer of information

LB sends information to JP when:

  • job is registered -- register job also with JP
  • after the job is purged -- upload complete LB data (as the dump file)

This information is stored, into two instances of persistent 'mail directories', each containing following subdirectories:

  • tmp/: all messages physically; other subdirectories contain links pointing here only
  • new/: new messages
  • work/: currently handled messages (running "transaction")
  • post/: postponed messages (after recoverable failure)
  • undeliverable/: data (after unrecoverable failure)

Involved components

LB server

LB server passes on arriving job registration, and responds to purge requests.

Required configuration information is:

  • maildir for the new job registrations
  • purge directory (see Purge above)

glite-lb-export.sh script

'glite-lb-export.sh' script, when enabled export to JP, does the following (besides calling the purge operation of the server):

  • break up the server-generated raw dumps into one file per job
  • store these files into "jobs spool" directory
  • record information about the broken-up dumps into the "dump maildir"

Required configuration information:

  • directory with raw dumps (=purge directory of the LB server), jobs spool, and dump maildir directories

JP importer daemon

Next step is reading all this local data on server and import them to Job Provenance Primary Storage, this is done by JP importer daemon

  • started by separate startup script 'glite-jp-importer' (from glite-jp-client RPM)
  • it forks several worker processes:
    • reading from maildir and importing to JP PS job registrations
    • reading from maildir and directory with exported jobs ("jobsdir") and importing to JP PS job dumps
    • reading and handling sandboxes (not used currently)
  • information needed for JP importer daemon:
    • "job registration" maildir filled by bkserver
    • "dump" maildir filled by 'glite-lb-dump-exporter'
    • helper "sandbox" maildir from maildir (not unused)
    • JP PS server
    • (used "jobsdir" with dumps filled by lb-dump-exporter, directory location is known from information in "dump" maildir)
    • ("sandbox" maildir is prepared for input and output sandboxes, import to JP is working, but getting sandboxes is not currently implemented)
  • YAIM configuration should be the same as for LB server and cron purger, no, better in LB YAIM module
  • need to run when enable GLITE_LB_EXPORT_ENABLED

JP export certification scenario

Additional configuration:

export GLITE_LB_EXPORT_JPPS=https://pelargir.ics.muni.cz:8901
export GLITE_LB_EXPORT_ENABLED=true

Additional needed service:

$GLITE_LOCATION/etc/init.d/glite-jp-importer start

Log a buchn of jobs.

Almost immediately (JP impoter maildir pool period) should work:

$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_cleared http://egee.cesnet.cz/en/Schema/JP/System:regtime
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_aborted http://egee.cesnet.cz/en/Schema/JP/System:regtime
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_cancelled http://egee.cesnet.cz/en/Schema/JP/System:regtime
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_waiting http://egee.cesnet.cz/en/Schema/JP/System:regtime

Wait for purge, then should work:

$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_cleared http://egee.cesnet.cz/en/Schema/LB/Attributes:user
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_aborted http://egee.cesnet.cz/en/Schema/LB/Attributes:user
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_cancelled http://egee.cesnet.cz/en/Schema/LB/Attributes:user
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_waiting http://egee.cesnet.cz/en/Schema/LB/Attributes:user

$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_cleared http://egee.cesnet.cz/en/Schema/LB/Attributes:jobId
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_aborted http://egee.cesnet.cz/en/Schema/LB/Attributes:jobId
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_cancelled http://egee.cesnet.cz/en/Schema/LB/Attributes:jobId
$GLITE_LOCATION/examples/glite-jp-primary-test -s $GLITE_LB_EXPORT_JPPS GetJobAttr $jobid_waiting http://egee.cesnet.cz/en/Schema/LB/Attributes:jobId

When successfully finished, all temporary directories should be empty.

Environment configuration

Variable Default value Description
GLITE_LOCATION script location glite prefix
GLITE_HOST_CERT /etc/grid-security/hostcert.pem Host certificate.
GLITE_HOST_KEY /etc/grid-security/hostkey.pem Host key.
GLITE_LB_PURGE_ENABLED true When false all purging and further processing is disabled.
GLITE_LB_EXPORT_ENABLED false When false, further processing to JP is disabled.
GLITE_LB_EXPORT_PURGE_ARGS --cleared 2d --aborted 2w --cancelled 2w --other 60d Arguments for purge utility. Main goal is configuring the various timeouts for jobs in given states.
GLITE_LB_EXPORT_PURGEDIR $GLITE_LOCATION_VAR/purge Export of LB job records into spool directory. It uses glite-lb-purge utility. glite-lb-exporter.sh reads this spool directory in a regular manner and implement next processing of LB dumps.
GLITE_LB_EXPORT_PURGEDIR_KEEP - If specified, keep handled dumps in $GLITE_LB_EXPORT_PURGEDIR_KEEP (may be used for backup purposes).
GLITE_LB_EXPORT_JPDUMP_MAILDIR, GLITE_LB_EXPORT_JOBSDIR $GLITE_LOCATION_VAR/jpdump, $GLITE_LOCATION_VAR/lbexport LB-exporter do its processing of LB dumps in $GLITE_LB_EXPORT_PURGEDIR (they are in per job form) and passes on it to the JP-importer using the spool directory $GLITE_LB_EXPORT_JPDUMP_MAILDIR and temporary storage $GLITE_LB_EXPORT_JOBSDIR. It can keep the job files for further usage.
GLITE_LB_EXPORT_JPREG_MAILDIR $GLITE_LOCATION_VAR/jpreg When new job come to the LB server it stores its registration into the spool directory. It is responsibility of JP-importer process to handle such registrations.
GLITE_LB_EXPORT_BKSERVER localhost BKserver host to purge.
GLITE_LB_SERVER_PORT 9000 BKserver port to purge.
GLITE_LB_EXPORT_JPPS empty (will use localhost:8901) Target JP PS.

Recommended YAIM options

Administrator could have possibility to set configurations:

  • GLITE_LB_PURGE_ENABLED
  • GLITE_LB_EXPORT_PURGE_ARGS

In addition, the following should be settable when JP export is considered:

  • GLITE_LB_EXPORT_ENABLED
  • GLITE_LB_EXPORT_JPPS
  • optionally directory locations or GLITE_LOCATION_VAR variable used in LB=>JP machinery
  • optionally backup directories for purges (but better way is probably using export to JP PS)

The rest should be always OK with default values.