Date: 2015 April 9 3-4PM.
Attendees: G. Oakham, D. Rogers, T. Gregorie, T.Xu, R.Thomson
W. Hong, M. Hu, S. Wang
1. Review minutes and approval
Michael asked that the minute of March 2nd can be approved as listed on physical website if there are no objections. David indicated there are two errors in last minutes as follows.First error is that ottawahospital.q only has 32 cores instead of 56 cores in the minutes. Second error is that the minutes didn't list the discussion about the offsite backup or move servers to different locations such as hpcvl server room. Michael said he will correct and update minutes on the website.
2. Common IT issues Discussion and Follow-up
Michael updated the recent purchase statuses that 2 Dell UPS arrived and 6 2TB disks and 2 1TB disks arrived which are for our data servers: cisk, tusker and ngoma.
Michael stated two IT issues as follows. First of all, one cluster node--atlas28 run out of service recently, thus, we lost 20 cores in total ( we lost atlas28 and egs19.egs20,egs23 last year). We are looking for replacement which is funding by research ever green project. Secondly, one of data servers--ngoma have been down couple of times recently, and data067 on ngoma is almost 100% full. Michael asked all research groups to clean up data folders. Tong inquired what ngoma down means, what is outcome of ngoma down and if disk fragment is the reason to cause the issue. Michael answered that big disk I/O on caused ngoma un-responsive, which refers to ngoma down, the consequence of ngoma down is that data folders on ngoma turn un-available, eventually affect cluster jobs which require to read from/write to those data folders. Additionally, Wade clarified nogma is using soft raid 6 which can take care of disk fragment issue.
Rowan raised questions about macmail and pine slow, Gregorie also mentioned he has similiar issue about macmail. Michael recommends thunderbird for Mac as our mail client.
3. Queuing System Update
Michael updated all queues work well. He asked if we should tune egsshort.q by reserving one core per egsshort hosts(egs43 to egs52) since David mentioned egsshort.q can't start to run jobs right after submission during all resource over-subscribing situation. Both Dave and Rowan suggested we can remain egsshort.q as it is now since they don't have highly demand on egsshort.q for now. Michael said we can tune egsshort.q any time if further change requires.
4. Home Folder and New Backup Discussion
Michael updated the separation home folders between faculty/staff and students completely recently, and all faculty/staff email are running very well since that. Michael presented size statics about both faculty/staff and student home folders as follows.
There are 56 users and 304G in total of faculty/staff home folders, 98 users and 337G in total of students' home folders.
Michael presented PPT slides to demonstrate new backup diagram. In the new backup, there are running two backup jobs: daily backup and weekly backup. Daily backup is running increment backup every day since the full backup on each Sunday. Weekly backup is running full backup on every Sunday. Daily backup has been stored on same host, but weekly backup is implemented cross hosts such as: cisk weekly backup is stored on tusker, tusker weekly backup is stored on cisk. This setting will prevent data lose if any host between tusk and cisk down. Tong inquired if any issue for weekly backup cross hosts, Michael said it is running smoothly so far. Michael also asked co-operation from all research groups if they can help to clean home folders, which can narrow down weekly backup time window.
5. CMO
Michael presented CMO budget for 2015-2016. The budget is formed by 3 columns which are 2015 budget, 2014 expense, and 2014 budget. Each column is categorized into 6 items: printer supplies and equipment maintenance, software purchases, server upgrades, equipment purchase, networks changes (charges from CCS), miscellaneous, totals. The total of new budget is 11287.00, 2014 expense is 9879.62, 2014 budget is 10500.00. Gregorie inquired what the equipment purchase and how we used. Michael said the equipment purchases are for any devices we may use. Gerald stated there is policy not to allow carrying big amount to next year, we have to spend it by end of April, and thus, we spent 2,500 for the equipment purchase.
Michael listed the quotes for HP printer maintenance kits and disks and disk trays for virtualization appending to the new CMO budget.
6. Departmental IT Infrastructure Changes Proposal
Michael updated the completion of home folders separation, the reset of projects will be started later on.
7. AOB
Wade said he has two items to add in AOB. Item1 is the replacement of a section of roof on HP started on May, which may cause outage for 2/3 weeks, he will keep everybody posted.
Item2 is A/C repair, which may put temporary A/C units in the hallway. Dave and Tong asked it will affect their labs if temporary A/C units pump hot air to hallway, and how long it will last. Wade said the decision has been decided yet, it is heads up here. Gerald asked Wade to send email notification to all faculties so they can plan arrangement. Wade said he is trying to push not to shut down cluster and servers.
8. Proposal next meeting
Michael asked time to schedule next meeting, Gerald said summer is supposed to break for computer meeting unless we have something to update, next meeting will be started in fall.