News Story

Computer Meeting Minutes - November 10, 2014

Date: 2014 November 10 1-2PM.

Attendees: G. Oakham, D. Rogers, P. Kalyniak, R. Thomson, T. Gregorie,
                  W. Hong, M. Hu, S. Wang

1. Theory Group Compute Server Purchase

    We reviewed the quotes from Dell, the best option was the Dell 1U R630 server. It was configured 
    with the following hardware specs: CPU: Dual Intel Xeon E5-2680, 12 cores/CPU, 24 cores in total, 
    RAM: 32 GB, HD: 250GB. This server has about 5 times the floating point performance of the Intel 
    E5450 which is the fastest processor of the nodes on the current cluster, per core. This represents 
    the best value in terms of performance per dollar per core.

    Rowan inquired about the possibility of matching the current specifications of 2 GB of RAM per core. 
    For this, the configuration would need to be updated with 48 GB of RAM. Wade estimated that it will 
    in the neighbourhood of $400 for the additional 16G RAM based on the list price. It is estimated that 
    the cost of the server would be $6400 with taxes. 

    Pat on the behalf of the Theory group has indicated that there is sufficient cumulative funds for the 
    initial quoted purchase. There was some discussion regarding increasing the memory to 48GB. Dave 
    volunteered to cover the additional cost.

    Wade is still waiting for a competitive quote from another vendor(Cisco) which was expected this 
    week. Any purchases over $5000 requires a second competitive quote. It may be that the cost of the 
    purchase is less than than the original quote from Dell.

2. IT Issues Discussion

    With Stephen moving to a smaller office, this afforded the opportunity to clean out some antiquated 
    hardware and software. Among the stored material was 2 large boxes of DLT tapes which were used 
    for backups and to transport data for the SNO experiment. Current backups are disk to disk, and the
    tapes have not been used for more than 5 years. The Computer Committee was asked for 
    permission by Michael to dispose of these tapes. It was agreed that they should be discarded but 
    provided that the tapes were either destroyed or demagnetize.

    Gerald was satisfied with the removal of the pine pattern filter from his pine settings, as this sped up 
    the launching of pine. Dave inquired about training the mail server spam filter to detect repetitive 
    spam messages. Wade indicated that there was a solution for this.  This entailed sending a spam 
    message to Bill previously who would then add it to a spam folder which would be used to train the 
    spam filter. Wade's own spam filter is also used to train spam assasin. Michael has inquired about 
    how this functions and would assume this responsibility.

    With regards to potential sources to augment the research compute cluster, Dave inquired with his 
    collaborators at the Ottawa Hospital. There was overall interest, but not at this time.

    Over the past weekend there was a power disruption (it appeared to be a lost of phase of the line 
    power). The result was that most of the servers went down and did not come back up in an orderly 
    fashion, so many systems had to be rebooted in sequence in order to ensure that the required 
    services were available to bring a system up. The result of the power disruption was the lost of a 
    Cisco switch connecting management nodes to the compute cluster, a Sun server chassis (ods), and 
    some hard drives. The ods chassis was swapped with another node not currently in use, and the 
    Cisco switch was replaced with another old Cisco switch. All services were restored by early Sunday 
    afternoon.

3. CMO

    Gerald asked for evergreening proposals for the current CMO funds for the current academic cycle. 
    There was not an opportunity to discuss this prior to this meeting. This will be an action item for the 
    next meeting.

4. State of Current Research Computing Infrastructure

    Rowan raised the point that in the previous meeting that there was a discussion about an 
    evergreening proposal for the research computing cluster. After some discussion, it was proposed 
    that perhaps the purchase of the Theory compute server could be a possible model for evergreening 
    the compute cluster. With the 24 cores, this represents the equivalent of 20% of the compute 
    capacity of the current cluster. In theory, in 5 years, with a modest investment of $6K, the current 
    compute capacity could be completely replaced. But due to the current computing demands, this is 
    likely not sufficient. 

    Another point raised was that the research computing infrastructure includes storage which is also 
    ageing. Gerald inquired about the frequency of disk failures, Stephen indicated that there have been 
    two disk failures since this May. Stephen also indicated that we have two spare disks for hot 
    replacements. Wade related another disk failure example that just happened in HPCVL recently. It 
    was agreed that in the long run, we should start to look solutions to replace/upgrade our disk storage. 
    The storage server capacity was reviewed - cisk and tusker has 48 disks each with 2TB capacity, 
    ngoma has 48 disks each with 1TB capacity - with current disk space consumption at about 60%. 
    Gerald proposed evergreening disks by replacing current disk with larger hard disks which is the 
    current practice. Wade, however, indicated that the disk storage chassis are ageing as well and will 
    need to be replaced in time. A possible chassis replacement proposed by Wade was to look at open
    hardware such as BackBlaze's open specifications for their disk servers as lower cost alternatives. 
    Gerald indicated that we should also address this in the evergreening plan for the research 
    computing evergreening plan. This will be discussed further in the next meeting.

5. Queuing Discipline

    In the last meeting, Wade proposed returning to the resource owner-based priority queuing that was 
    implemented before. However, this does not address an issue raised by Dave regarding the ability to 
    accommodate quick short jobs from non-resource owner users. The discussion continued regarding 
    the priority for the Theory group for their proposed server as it is integrated into the general compute 
    cluster. Rowan brought up the previous practice of discussing among users within a research group 
    alerting them to the fact that a major compute request was being made for available resources and 
    how this should be extended to all the research groups. There was some discussion about what the 
    best forum was for having this discussion. Thomas suggested perhaps a mailing list was the simplest. 
    Another issue raised was, who would be the arbitrator if there are contending requests. The notion of 
    a RAC - resource allocation committee was discussed briefly. 

    An action item as time was expiring the next meeting was a proposal for the next meeting.

6. Proposed Agenda for next meeting on December 10th 2:30-4:30PM

   o Review minutes from the previous meeting
   o Approval of the minutes
   o Evergreening of Research Computing Cluster Proposal
   o Queuing Discipline Proposal
   o Departmental IT Infrastructure Changes Proposal
   o AOB

7. A call for agenda items will be made a week or two in advance of the next meeting

Search Carleton