News Story

Computer Meeting Minutes - February 2, 2015

Date: 2015 February 2 3-4PM.

Attendees: G. Oakham, D. Rogers, R. Thomson, T. Gregorie, T.Xu
                  W. Hong, M. Hu, S. Wang

1. Review Minutes and Approval

 

Michael asked that the minute of December 10th can be approved as listed on physical website if there are no objections.

2. Theory Group Server Update

Michael updated the Theory group computer server has been setup on 2014 December 18th and joined to our cluster system. Theory group students started to run job on December 22nd. The server is running well without any issue by far. Michael asked if any concern about the server from theory group, T.Gregorie indicated there is not much heavy computing from theory group.

3. IT Issues Discussion

Michael indicated there is an issue of disk failure on wiki server (vor) happened before holiday. The failed disk is the data area of all wiki sites and we didn't have backup for the server. Wade tried to recovery the disk without any successfully. Michael tried to recovery all wiki web content through Google and Bing snapshot. Atlas-wiki is the only website that has snapshot on Google or Bing, the rest of wiki websites don't have any snapshot on Google and Bing. Michael said he retrieved around 60% web content for the "atlas-wiki". The rest of wiki sites lost all web content. Wade checked with Alain, Alain said "ilc-wiki" is the most important wiki for ATLAS group and need to rebuild soon.

Dave raised a question regarding the robot.txt to prevent his home page indexing by web search engine. Wade said it is setup by Bill Jack and he can't recall the reason to set the robot.txt. However, we will make change after we move web server to virtualize platform. Dave asked estimate time for that. Wade indicated it will be starting after we re-configure queuing system and separate home folders.

Cluster users continue to generate data to their home area, slow down mail system. Michael said either writting big data to or reading big data from home folder from cluster jobs cause NFS issue. We need a better way to prevent this happens more and more. Wade proposed a solution to separate faculty and staff’s  home folder from student’s home folder
 

4. Home Folder Separating Discussion

Michael listed all faculty and staff home folders and all students’ home folders by size. There are 24 users in faculty and staff group whose home folder size is larger than 1G. There are 25 users in students group whose home folder size is larger than 1G. Total size of home folder in faculty and staff are around 457G, total size of home folder in student group are around 260 G. Michael proposed to migrate student home folder to different server (tusker).

5. Queuing System Discussion

Michael presented powerpoint slides to show the new queuing system. The new queuing system reserves resource for resource owner jobs by setting up different priorities on different group hosts. First of all, all users is sorted by groups. There are 8 major groups: atalas,egs,theory,ottawahospital,sno,exo,ilc,guest to be created. There are 6 new queues: atlas.q, egs.q, egsshort.q theory.q ottawahospital.q, guest.q (Dave recommended to create guest.q for guest group) to be created. Each group will be associated to its own queue only controlled by access list. In more detail, atlas group users only have permission to submit jobs to atlas.q, egs group user only have permission to submit jobs to egs.q or egsshort.q, theory group users only have permission to submit jobs to theory.q ... and so on. Each queue is configured resource own hosts with high priority and another rest resource with low priority as shown as follows.

 1. atlas.q is configured as all atlas nodes with high priority, all other group resouce ( all egs nodes, all  ottawahospital nodes,  theory node ) with low priority.
 

 2. egsshort.q is configured to run jobs in short time ( 15 minutes ), which contains 10 fast egs nodes ( egs43 - egs52) with high priority.

 3. egs.q is configured as the egs 10 fast nodes with middle priority, egs rest nodes with high priority and all other resource ( all atlas nodes, all ottawahospital nodes, theory node ) with low priority.

  4. ottawahospital.q is configured as all ottawahospital nodes with high priority, all other resource ( all atals node, all egs nodes, theory node) with low priority.

  5. theory.q is configures as the theory node with high priority, all other resouce ( all atlas nodes, all egs node, all ottawahospital nodes ) with low priority.

  6. guest.q is configured as all resouce with low priority, which is for guests only.

Michael checked with Gerald Atlas.q can grant access for Exo group, Sno group, Ilc group. We don't have to create dedicate queues for exo,sno and ilc group.
 

Egsshort.q is only for egs group to run short jobs so it has been set wallclock to 15 minutes. Any jobs running more than 15 minutes in the egsshort.q will be terminated. We can adjust the wall clock time if need in the future. Both Dave and Rowan all agreed with that configuration.

6. State of Current Research Computing Infrastructure

Michael showed the PowerPoint slides to list the IT infrastructure changes which will be starting soon by priority as follows.

  1. Home folder separation ( talked about in the item 4)
  2. Server Virtualization:

     -- Vor is our wiki server,which should be considered to move to VM.
     -- Vali is our web server,which should be considered to move to VM.
     -- Tyr/thor are our two interactive nodes, which should be moved to VM.
     -- Ran is our mail server,whish should be moved to VM.
     -- Thorkk/heisenberg(Sunray server),which should be moved to VM.

  3. Vidar/Mosshead is our primary and slave NIS server, which will be move 
      to another physical hardwares.
  4. Deploy swith with 10Gb interfaces.

 

7.CMO


IT group didn't get chance to prepare ever green of department IT infrastructure CMO proposal yet. Gerald asked if we can prepare it for next meeting.

8. AOB

Wade updated HPCVL was visiting campus and met Carleton research computing group today. HPCVL is trying to consolidate cluster center to couple of sites. Michael asked if the Carleton HPCVL site will be keep, Wade said there is no decision yet from HPCVL.

9. Proposal next meeting
Michael asked time to schedule next meeting, Gerald suggested the first week of March and setup doodle.

Search Carleton