News Story

Computer Meeting Minutes - March 2, 2015

Date: 2015 March 2 3-4PM.

 Attendees: G. Oakham, D. Rogers, T. Gregorie, T.Xu
                   W. Hong, M. Hu, S. Wang

1. Review minutes and approval

Michael asked that the minute of February 2nd can be approved as listed on physical website if there are no objections.

2. Common IT issues Discussion and Follow-up

Michael updated that ilc-wiki website is being setup and running and now waiting for wiki web content uploaded by ilc group.

The question raised by Dave about robot.txt preventing home page indexing by web search engineer has not been looked since IT group is working on project of separating home folders and IT will look at later on after the home folder project. Dave said he understood the priority and arrangement. 

Cluster users continue to generate data to their home area, slow down mail system. Michael said it should be fixed after faculty and staff home folders separating from students home folders.

Michael said there is disk failed on server "cisk" recently, and it is for cisk root pool. IT group replaced with spare hard disk. Since there are only 1 spare disk left, Michael asked if we should purchase more disks for future replacement. Gerald inquired the frequency of the disk failed and how many disks in total for 3 servers (cisk, tusker, ngoma). Stephen answered there are 144 disks in total ( 96 2TB disks on cisk and tusker, 48 1TB disks on ngoma) and 4 disks failed in this year. All agreed to purchase more disks for spare. Wade suggested Michael and Stephen check our CMO balance and purchase some 2TB disks and some 1TB disks.

3. Queuing System Update

Michael updated new queuing system configuration and detail for each queue as follows. There are 644 cores in total as 620 cores on executes hosts and 24 cores on interactive hosts. If we look at by group, atlas group has 376 cores, egs group has 212 cores, ottawahostpital group has 32 cores, and theory group has 25 cores. And we sorted all users to 8 groups as: atlas, egs, ottawahospital, theory, exo, sno, ilc and guest. There are configured 6 major queues as: atlas.q, egs.q, egsshort.q, ottawahospital.q, theory.q and guest.q. Currently guest.q has been disabled since there is not user in the group. Michael indicated the default queue--all.q has been disabled. Michael said he updated all detail information and user guide for how to using cluster on physics website so all group can direct their students or new users to the pages if they have some question about our new queuing system. Dave raised issue about egsshort.q. He can’t run jobs right away when the queue system has been oversubscribed. Wade said we can tune the egsshort.q by reserving dedicate resource for egsshort.q instead of allocating all cores on egsshort.q 10 nodes out all queues, but will waste resource if egsshort.q is not been using frequency. Wade said it is kind of trade off. Dave said he is happy for current setting and we can discuss this later if this is becoming real issue in the future.

4. Home Folder and New Backup Discussion

Michael updated the status of migrating faculty and staff home folders to from server cisk to server tusker. He said the migration is going very smoothly, Tong and Stephen confirmed they didn't experience any issue on the migration. Michael presented PowerPoint slides to explain current backup and new backup solution. He indicated current backup is only running on cisk, and there are two backup jobs running on cisk.  Job1 backs all home folders to back0 partition and job2 backs all stuff from back0 to back1, and it is running full backup on Sunday, and is running increment backup on the rest of the days. Michael pointed out there are two concerns for current backup method. First concern is that the job1 is running backup accumulatively, which means back0 is not exactly same copy of home folders, thus, back1 is also is getting larger and larger by time.  Consequently, the cisk will be running out of disk space. The second concern is the current backup storing all data on back0 and back1 on same host, which has high risk of data lose if the server has disk failure or power failure.  Michael proposed a new backup solution which has been discussed with Wade. As Michael presented in previous slide, cisk will be the home folder server for students, tusker will be home folder server for faculty and staff. In the new backup solution setting, there are 4 partitions to store backup data, back0 and back1 on cisk, data078 and data100 on tusker. Michael said he modified scripts for both job1 and job2. Job1 on cisk is backing up exactly all students' home folders to back0. Job1 on tusker is backing up exactly all faculty and staff's home folders to data078. Job2 on cisk is copying stuff from data078 on tusker to cisk's back1 partition; Job2 on tusker is copying stuff from cisk's back0 to tusker's data100. Job1 on both cisk and tusker are running every 9PM, Job2 on both cisk and tusker are running every 12:30AM, and it is running full backup on Sunday, running increment backup on the rest day of a week. The new backup solution can avoid the risk of running out of disk space and risk of one single server failure (power failure or disk failure). Gerald asked how long will the Job2 take from cisk to tusker and tusker to cisk, Wade said it will depends on changes on home folders and size of home folders, and Michael said it should be completed before Monday. Dave raised a concern that we should consider offiste backup or move one of the servers to different location according to the best practice of disaster recovery. Wade said we can move one of servers to 5th floor where the HPCVL is now since there are space, power and A/C  to accomandate our servers.

5. Departmental IT Infrastructure Changes Proposal

Michael said IT group didn't get chance to work on virtualization project and department IT infrastructure change projects.

6.CMO

Michael proposal to purchase two UPS for server room and disks for virtualization hosts, and he showed the quotes. The ups cost around $2,200.00, which looks good at current marketing. The 4TB disk price is around $164.00. Wade indicated we don't really want 4TB since it will be losing 4TB disk space when configuration RAID 6, 2TB seems good enough for us. Michael said he will work on new quote for 2TB disk. Gerald asked IT group to prepare CMO budget for the year of 2015-2016 next meeting.

7. AOB

Wade said he has two items to add in AOB. Item1 is the replacement of a section of roof on HP around April, which may cause outage. Item2 is there is possibility to add new clusters in our server room for a new hire faculty at Chemistry department.

8. Proposal next meeting

Michael asked time to schedule next meeting, Gerlad and Dave suggested the week of April 9th. Michael said he will setup doodle to vote 

Search Carleton