Audit Database
From BeSTGRID
Data provided by enabling globus audit is already used in Grid Operations Center.
But having this data locally gives us the ability to perform various queries to analyse performance of the cluster, create usage reports and identify potential problems. Of course, it would be even better to have this functionality in status.arcs directly...
Contents |
[edit] Collecting data from globus
original auditquery script deletes data form gram_audit_table after sending it to status.arcs Just retaining this data is not good because the size of data sent is only going to increase. Therefore we create backup_gram_audit_table and modify auditquery to move data into it before deleting:
CREATE TABLE `backup_gram_audit_table` ( `job_grid_id` text NOT NULL, `local_job_id` text, `subject_name` text NOT NULL, `username` varchar(16) NOT NULL default , `idempotence_id` varchar(128) default NULL, `creation_time` varchar(40) NOT NULL default , `queued_time` varchar(40) default NULL, `stage_in_grid_id` text, `stage_out_grid_id` text, `clean_up_grid_id` text, `globus_toolkit_version` varchar(16) NOT NULL default , `resource_manager_type` varchar(16) NOT NULL default , `job_description` text NOT NULL, `success_flag` varchar(5) NOT NULL default , `finished_flag` varchar(5) NOT NULL default , PRIMARY KEY (`job_grid_id`(256)), KEY `idxLocalJobId` (`local_job_id`(50)) )
example auditquery (because mysql audit database is also password-protected, I removed read permission from this script).
#!/bin/sh
# auditquery Queries the VDT Globus-WS Audit database and emails
# Job-Id/DN records to a monitoring address. Records
# are also logged via syslog, then dropped from the
# database. Intended for invocation at hourly intervals.
# Graham Jenkins <grahjenk@vpac.org> Last changed: 20061220
#
# Specify PATH, initiallise scratch-file
PATH=/usr/bin:/bin:/usr/sbin:/usr/sbin:/sbin:$PATH
. /etc/profile >/dev/null 2>&1
File=/tmp/`basename $0`.$$
trap 'rm -f $File; exit 0' 0 1 2 3 14 15
echo "use auditDatabase;" >$File
#
# Read and log records, create delete statements and send email
mysql -u grid-mysql --port 3306 -h mysql-bg.ceres.auckland.ac.nz -p***** <<EOF 2>/dev/null | sed -n '2,$p' |
use ng2_auditDatabase;
select concat_ws(' ',local_job_id, subject_name)
from gram_audit_table;
EOF
while read Line; do
logger -t Job-DN "$Line"
echo "Job-DN: $Line"
JobId="`echo $Line | awk '{print $1}'`"
echo "insert into backup_gram_audit_table select * from gram_audit_table where local_job_id = '$JobId';" >> $File
echo "delete from gram_audit_table " >>$File
echo "where local_job_id = '$JobId';">>$File
#done | /bin/mail -s "`hostname` JobID `date +%Y%m%d`" a.kharuk@auckland.ac.nz >/dev/null 2>&1
done | /bin/mail -c a.kharuk@auckland.ac.nz -s "`hostname` JobID `date +%Y%m%d`" grid_pulse@lists.arcs.org.au >/dev/null 2>&1
#
# Perform deletions
( cat $File
echo "delete from gram_audit_table "
echo "where local_job_id is NULL;" ) | mysql -u grid-mysql --port 3306 -h mysql-bg.ceres.auckland.ac.nz -p***** ng2_auditDatabase >/dev/null 2>&1
exit 0
[edit] Collecting data from torque
This section is specific to torque resource manager. Resources with other RMS will need different scripts.
We need some extra tables:
CREATE TABLE `pbs` ( `local_job_id` text NOT NULL, `queued_time` varchar(40) NOT NULL default , `start_time` varchar(40) NOT NULL default , `end_time` varchar(40) NOT NULL default , `cpu_count` int(11) default NULL, `queue` text, PRIMARY KEY (`local_job_id`(256)), KEY `idxLocalJobId` (`local_job_id`(50)) );
CREATE TABLE `exit_status` ( `local_job_id` varchar(256) NOT NULL, `exit_status` int(11) NOT NULL, PRIMARY KEY (`local_job_id`) );
This is sample python script to convert torque logs into SQL. It can be used as
cat /opt/torque/server_priv/accounting/*|python torque2mysql.py |mysql -h mysql-bg.ceres.auckland.ac.nz --port=3306 -u grid-mysql -p**** ng2_auditDatabase
to populate our tables. Duplicate entries are ignored so it is ok to run with old data. A bit slow with all logs, so may be worthwhile to run only with latest logs.
[edit] Example Queries
[edit] Usage for individual users over the last N days (example for 7 days)
select subject_name,
count(*) as count,
sum((unix_timestamp(str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y"))-
unix_timestamp(str_to_date(pbs.start_time,"%a %b %d %H:%i:%s %Y")))* pbs.cpu_count) / 60.0 / 60.0 as dif
from backup_gram_audit_table,pbs
where pbs.local_job_id = backup_gram_audit_table.local_job_id and
backup_gram_audit_table.subject_name LIKE '%' and
str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y") > subdate(now(), interval 7 day)
group by subject_name order by dif;
| subject_name | count| dif | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Yuriy Halytskyy | 10 | 0.00000000 | | /C=AU/O=APACGrid/OU=VPAC/CN=Markus Binsteiner | 54 | 0.00027778 | | /DC=au/DC=org/DC=arcs/DC=slcs/O=ARCS IdP/CN=Antoine Fouquet cpTxacQy-chYzfVDjjuAEWLyriM | 1 | 65.70916667 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Peter Shao-Wei Tsai | 10 | 114.40944444 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Jack Flanagan | 11 | 152.33694444 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Peter Wills | 2 | 161.89111111 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Sidney Markowitz | 331 | 205.25861111 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Steven Wu | 115 | 287.16194444 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Asad Ali | 69 | 451.54666667 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Brian Browning | 8 | 497.19055556 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Joseph Heled | 1600 | 534.40166667 | | /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Sharon Browning | 537 | 1473.76027778 |
[edit] Log last jobs from individual user
Example over the last 7 days
select substr(subject_name, instr(subject_name,'CN=') + 3) as name,
pbs.local_job_id,
idempotence_id,
pbs.queued_time as queued,
(unix_timestamp(str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y")) -
unix_timestamp(str_to_date(pbs.queued_time,"%a %b %d %H:%i:%s %Y"))) / 60.0 / 60.0 as total,
(unix_timestamp(str_to_date(pbs.end_time,"%a %b %d %H:%i:%s %Y")) -
unix_timestamp(str_to_date(pbs.start_time,"%a %b %d %H:%i:%s %Y"))) / 60.0 / 60.0 as run,
exit_status
from backup_gram_audit_table,pbs,exit_status
where subject_name LIKE '%Brian Browning%' and
str_to_date(pbs.queued_time,"%a %b %d %H:%i:%s %Y") >
subdate(now(), interval 7 day) and
backup_gram_audit_table.local_job_id LIKE '%.hpc-bestgrid.auckland.ac.nz%' and
pbs.local_job_id = backup_gram_audit_table.local_job_id and
pbs.local_job_id = exit_status.local_job_id
order by str_to_date(backup_gram_audit_table.queued_time,"%Y-%m-%d %H:%i:%s %Y");
| name | local_job_id | idempotence_id | queued | total | run | exit_status | | Brian Browning | 101526.hpc-bestgrid.auckland.ac.nz | a4d427fa-7b04-11de-90b4-e2e160fc8bd9 | Tue Jul 28 11:24:37 2009 | 24.01000000 | 18.49138889 | 271 | | Brian Browning | 101527.hpc-bestgrid.auckland.ac.nz | a8b212ba-7b04-11de-a4f3-e2e160fc8bd9 | Tue Jul 28 11:24:43 2009 | 24.01027778 | 18.46694444 | 271 | | Brian Browning | 102814.hpc-bestgrid.auckland.ac.nz | d928de24-7cad-11de-bc0f-e2e160fc8bd9 | Thu Jul 30 14:08:21 2009 | 24.00166667 | 24.00138889 | 271 | | Brian Browning | 102815.hpc-bestgrid.auckland.ac.nz | dd9200da-7cad-11de-a111-e2e160fc8bd9 | Thu Jul 30 14:08:29 2009 | 24.01888889 | 24.01861111 | 271 | | Brian Browning | 102816.hpc-bestgrid.auckland.ac.nz | df649f58-7cad-11de-9848-e2e160fc8bd9 | Thu Jul 30 14:08:31 2009 | 24.02000000 | 24.01972222 | 271 | | Brian Browning | 103973.hpc-bestgrid.auckland.ac.nz | eb43ec7e-7e39-11de-8548-e2e160fc8bd9 | Sat Aug 1 13:23:32 2009 | 24.07500000 | 0.00138889 | 271 |
