Date: 2014 November 10 1-2PM.
Attendees: G. Oakham, D. Rogers, P. Kalyniak, R. Thomson, T. Gregorie,
W. Hong, M. Hu, S. Wang
1. Theory Group Compute Server Purchase
We reviewed the quotes from Dell, the best option was the Dell 1U R630 server. It was configured
with the following hardware specs: CPU: Dual Intel Xeon E5-2680, 12 cores/CPU, 24 cores in total,
RAM: 32 GB, HD: 250GB. This server has about 5 times the floating point performance of the Intel
E5450 which is the fastest processor of the nodes on the current cluster, per core. This represents
the best value in terms of performance per dollar per core.
Rowan inquired about the possibility of matching the current specifications of 2 GB of RAM per core.
For this, the configuration would need to be updated with 48 GB of RAM. Wade estimated that it will
in the neighbourhood of $400 for the additional 16G RAM based on the list price. It is estimated that
the cost of the server would be $6400 with taxes.
Pat on the behalf of the Theory group has indicated that there is sufficient cumulative funds for the
initial quoted purchase. There was some discussion regarding increasing the memory to 48GB. Dave
volunteered to cover the additional cost.
Wade is still waiting for a competitive quote from another vendor(Cisco) which was expected this
week. Any purchases over $5000 requires a second competitive quote. It may be that the cost of the
purchase is less than than the original quote from Dell.
2. IT Issues Discussion
With Stephen moving to a smaller office, this afforded the opportunity to clean out some antiquated
hardware and software. Among the stored material was 2 large boxes of DLT tapes which were used
for backups and to transport data for the SNO experiment. Current backups are disk to disk, and the
tapes have not been used for more than 5 years. The Computer Committee was asked for
permission by Michael to dispose of these tapes. It was agreed that they should be discarded but
provided that the tapes were either destroyed or demagnetize.
Gerald was satisfied with the removal of the pine pattern filter from his pine settings, as this sped up
the launching of pine. Dave inquired about training the mail server spam filter to detect repetitive
spam messages. Wade indicated that there was a solution for this. This entailed sending a spam
message to Bill previously who would then add it to a spam folder which would be used to train the
spam filter. Wade's own spam filter is also used to train spam assasin. Michael has inquired about
how this functions and would assume this responsibility.
With regards to potential sources to augment the research compute cluster, Dave inquired with his
collaborators at the Ottawa Hospital. There was overall interest, but not at this time.
Over the past weekend there was a power disruption (it appeared to be a lost of phase of the line
power). The result was that most of the servers went down and did not come back up in an orderly
fashion, so many systems had to be rebooted in sequence in order to ensure that the required
services were available to bring a system up. The result of the power disruption was the lost of a
Cisco switch connecting management nodes to the compute cluster, a Sun server chassis (ods), and
some hard drives. The ods chassis was swapped with another node not currently in use, and the
Cisco switch was replaced with another old Cisco switch. All services were restored by early Sunday
afternoon.
3. CMO
Gerald asked for evergreening proposals for the current CMO funds for the current academic cycle.
There was not an opportunity to discuss this prior to this meeting. This will be an action item for the
next meeting.
4. State of Current Research Computing Infrastructure
Rowan raised the point that in the previous meeting that there was a discussion about an
evergreening proposal for the research computing cluster. After some discussion, it was proposed
that perhaps the purchase of the Theory compute server could be a possible model for evergreening
the compute cluster. With the 24 cores, this represents the equivalent of 20% of the compute
capacity of the current cluster. In theory, in 5 years, with a modest investment of $6K, the current
compute capacity could be completely replaced. But due to the current computing demands, this is
likely not sufficient.
Another point raised was that the research computing infrastructure includes storage which is also
ageing. Gerald inquired about the frequency of disk failures, Stephen indicated that there have been
two disk failures since this May. Stephen also indicated that we have two spare disks for hot
replacements. Wade related another disk failure example that just happened in HPCVL recently. It
was agreed that in the long run, we should start to look solutions to replace/upgrade our disk storage.
The storage server capacity was reviewed - cisk and tusker has 48 disks each with 2TB capacity,
ngoma has 48 disks each with 1TB capacity - with current disk space consumption at about 60%.
Gerald proposed evergreening disks by replacing current disk with larger hard disks which is the
current practice. Wade, however, indicated that the disk storage chassis are ageing as well and will
need to be replaced in time. A possible chassis replacement proposed by Wade was to look at open
hardware such as BackBlaze's open specifications for their disk servers as lower cost alternatives.
Gerald indicated that we should also address this in the evergreening plan for the research
computing evergreening plan. This will be discussed further in the next meeting.
5. Queuing Discipline
In the last meeting, Wade proposed returning to the resource owner-based priority queuing that was
implemented before. However, this does not address an issue raised by Dave regarding the ability to
accommodate quick short jobs from non-resource owner users. The discussion continued regarding
the priority for the Theory group for their proposed server as it is integrated into the general compute
cluster. Rowan brought up the previous practice of discussing among users within a research group
alerting them to the fact that a major compute request was being made for available resources and
how this should be extended to all the research groups. There was some discussion about what the
best forum was for having this discussion. Thomas suggested perhaps a mailing list was the simplest.
Another issue raised was, who would be the arbitrator if there are contending requests. The notion of
a RAC - resource allocation committee was discussed briefly.
An action item as time was expiring the next meeting was a proposal for the next meeting.
6. Proposed Agenda for next meeting on December 10th 2:30-4:30PM
o Review minutes from the previous meeting
o Approval of the minutes
o Evergreening of Research Computing Cluster Proposal
o Queuing Discipline Proposal
o Departmental IT Infrastructure Changes Proposal
o AOB
7. A call for agenda items will be made a week or two in advance of the next meeting