Date: 2014 December 10 2-4PM.
Attendees: G. Oakham, D. Rogers, A. Bellerive, R. Thomson, T. Gregorie, T.Xu
W. Hong, M. Hu, S. Wang, E.Lacelle
1.Review minutes and approval
Michael asked that the minute of November 10th can be approved as listed on physical website if there are no objections.
2. Updates from last minutes
Wade updated the Theory group computer server just arrived at campus on December 10th. Michael updated two boxes of tapes have been disposed by de-magnetic and physical damaged.
3. IT Issues Discussion
It is an old topic but still a big issue. Students generated data to their home area from their running cluster jobs, which created big I/O and caused NFS issues, consequently, slow down mail system. Rogers raised question if any tools/possibility to find out who is writing data to home area and how big the data file has been generated on home area. Wade will take look possible options to check who is generating data on home area and how big those data are. Michael emphasized users are supposed to write their data to their data area instead of home area. Rowan said the only reason why she and her students save data to home area just because there is no backup on data area. Wade explained there is soft raid 6 configuration on data area, which means the chance of losing data in data area is very low and it only happens when two physical hard disks failed at same time.
4. CMO
Eva updated CMO balance of the year of 2014 which is $ 7924 up to date. She indicated there is around $5600 to expense since we need to budget around $2200 for printer paper, printer cartridge and other daily consuming stuff. There is around $4000 in CMO funds for ever greening of department IT infrastructure, Michael proposal to purchase UPS with that fund. As we are aware in last meeting, there happened losing power phase in our server room couple of weeks ago. Wade spent over 5 hours to bring systems back in order. The UPS will provide some graceful time for IT group to bring systems back in sequence and protect against power surges also. Gerald inquired how long an UPS keep systems running. Wade said there is no standard time and it is based on system load, normally it can last 15 to 30 minutes. Wade indicated we are looking for two 1u rackspace upses with price around $1700 each, $3400 in total.
5. Queuing Discipline
We continue on discussion on the queue displace since we didn't finish in the last meeting. In general, the queue system is suppose to reserve resource for resource owner jobs, which can be done by setting up different queues with different priorities. But this approach might waste resources when there are no high priority jobs running. To avoid the waste, Michael proposed an alternate approach by setting up the subordinate queue discipline. In this setting, a high priority queue is configured with one or more subordinates queues, Jobs running in subordinated queues are suspended when the higher priority queue becomes busy, and they are resumed when the higher priority queue is not busy any longer. Allan pointed it is not a good idea to suspend any job and he believed SGE(Sun Grid Engine) must be clever enough or we should configured SGE clever engough to utilize all resource without suspending any jobs. Wade proposed a different configuration by setting up priority on host nodes instead of on the queue. In this setting, we will create seperate queues for each groups, and users in each group are only allowed to submit jobs to their own queue. Each queue contians their own nodes with high priority and another group nodes but with low priority level. In last meeting, Rowan raised question to setup short queue for egs group. Michael asked if it is okay to run egs short queue on all nodes, Allan clarified that the egs short queue should only be running on egs nodes. Michael asked what is maximum running time for jobs in the short queue, Dave and Rowan suggested 15 minutes will be reasonable time for now. Wade said it can configured wall clock to 15 minutes and adjustable. So, there are 5 queues to be configured in total as follows: Atlas queue, Egs queue, Egs short queue, Theory queue, Hospital queue. Michael will discuss more detail about the queue setting with Wade and bring proposal to next meeting.
6. Departmental IT Infrastructure Changes Proposal
Michael proposed 9 items on department IT infrastructure as follows.
1. Tyr,thor are our two interactive nodes, we plan to move them to vm since both of them are running out of date hardware.
2. Ran is our mail server, also should be considered to migrate to virtualization platform with installation new applications: postfix+ dovecot for imap function, maildir replaces inbox,keep our mail spam system:spam assassin, and install new some new open source virus scanner software.
3. Vali is the web server for use homepage, graduate evaluation form server and calendar server, which also should be considered to migrate to VM.
4. Vidar is the primary NIS, DNS,DHCP,Printer server, which should be considered to move to a new physical hardware. This is not a good ideal to migrate to VM since it is the major user authentication server.
5. Moosehead is the slave NIS for cluster domain, which should be considered to move to a new physical node . In same reason as vidar, we should keep this on physical hardware.
6. Thokk,heisenberg are the sunray server, which should be considered to move to VM. We can re-use our sunray thin clients after upgrading.
7. Vor is a wiki server,which should be considered to move to VM
8. cisk is the home area data server. We will create backup isolation by backing up to different hosts instead of on cisk, and isolate faculty home folders from students home area.
9. Deploying a new switch with 10Gb interfaces.
Gerald asked to prioritize the items. Rowan inquired how long the changes will be implemented. Wade indicated it may take more than 6 months to complete roughly.