N8 HPC Technical Management GroupMinutes

28th Jan 2015 10.30am – 3.30pm

Sackville Street Building, University of Manchester

Attending

Chair - Alan Real, University of Leeds

Cliff Addison, University of Liverpool

Andrew Smith, University of York

Robin Pinning, University of Manchester

Gillian Sinclair, University of Manchester

Mark Dixon, University of Leeds

Graham Collins, Newcastle University

Anthony Brookfield, University of Sheffield

Apologies

Mike Griffiths, University of Sheffield

Mike Pacey, Lancaster University

Actions Ongoing

Action No. / Assigned to / Action
1.1 / AS / RLP / AR / Send details of what accounting information and what email alerts they would like via the wiki page.
July update - Leeds, Manchester and York to have a phone conference to work together on accounting information. York have volunteered a student to do some of the work.
Nov update – The conference call hasn’t happened. York might visit Leeds instead. York still have a student available but only 1 day a week as he has been partly assigned to another project. He is currently analysing the data.
Jan update – As is continuing to analyse the stats and will report back by the end of Feb.
4.4 / Leeds / To start discussion on privacy statements etc.
July update – ongoing. MD is continuing to discuss as we need updated T&Cs. All users would have to agree to these.
Nov update - No progress made. Trouble pinning down the person at Leeds.
Jan – action still ongoing
5.3 / GS / Get announcement / press release on website about Farr Facility
Nov update – waiting for Farr machine to be available
6.1 / CA / To write a page for the internal website on how to use Sharepoint.
Jan update – will be done by 6th Feb
6.6 / Leeds / Prepare a news article on tips and tricks with specific reference to debugging, testing and development techniques.
Jan update – ongoing. Additions will be made to an existing article.
6.7 / MD / ask about GS getting access to email addresses for use of passing information onto users regarding services, privacy statements.
Jan update - ongoing
6.13 / MD / To devise / obtain an explicit acceptance of privacy statement for project renewal.
Jan update - ongoing

New Actions

Action No. / Assigned to / Action
7.1 / LL / To see if the speed of the N8 HPC WordPress site can be improved.
7.2 / AR / Chase up Intel contacts regarding Xeon Phi workshop
7.3 / CA / Check with George Barakos regarding hosting the Xeon Phi workshop at Liverpool
7.4 / MC / add in more on advanced reservations to the article on tips and tricks
7.5 / CA / Contribute to article on tips and tricks
7.6 / GS / PDG / Come up with questions for a BDM survey before meeting in March.
7.7 / DS / Write a news article on use of N8 HPC for CDT training which can be repurposed by TMG for local dissemination.
7.8 / CA / Send AR and RLP any information about the data centres that he can.
7.9 / AS / Share data with Leeds and to meet with MD and discuss report. The data will then be circulated more widely and issues will be brought to the April meeting.
7.10 / All / Suggest possible lead academics for bio-health, fluid dynamics, aerospace and Insigneo at Sheffield network events.
7.11 / Leeds / Investigate queuing issues at Durham
7.12 / GS / Checkwith MD if it’s clear on the N8 HPC website about job running times.
7.13 / Leeds / Once Leeds and York have discussed the current issues, a “manage expectations” page / text should be added to website and newsletter etc.
7.14 / GS / Investigate a blog post / alternative way of passing on technical information to the users to help them tackle their issues.
7.15 / GS / Find email from RLP and ensure the IT directors meeting is moving along.
7.16 / GS / Make it clearer that users who used their local helpdesk were happy.
7.17 / GS / Let the TMG know the user choices in the question regarding code.
7.18 / GS / Once user report is finalised, draw up an action plan from users feedback on how we will respond and react.
7.19 / GS / Follow up with the suggested names for the Business Engagement event speakers

Minutes

1. Minutes and actions

The minutes from the previous meeting were approved.

Action 3.6 on MKB has been removed. It was to check with BIS and HEFCE to see if there had been progress made re: Carbon & CSC. This will now be raised at the next N8 HPC PDG meeting.

Action 3.7 on Allto name a training contact and to put the name on the wiki has been done but this led to a discussion on the WordPress site being very slow.

New action 7.1 – LL – to see if the speed of the WordPress site can be improved.

Action 5.6 on AR to contact Intel about a Xeon Phi workshop.

This action has been done. AR spoke to Intel reps at SC14 who suggested that Steven Blair-Chappelmay be the person to give the workshop. They took this away as an action so but AR will chase them up. We need a site who would want to host the workshop. Possibilities for hosting the workshop include York and Liverpool as Manchester has already hosted one of these events. Liverpool has training rooms that can hold 15 – 20 people but all that’s really required is a room with good wifi. CA will see if George Barakos is interested in hosting the event as he is a big N8 HPC user and has suitable code. If not the event will be hosted at Leeds and MC will be the organiser.

New action 7.2 – AR – to chase up Intel contacts regarding Xeon Phi workshop

New action 7.3 – CA – to check with George Barakos regarding hosting the Xeon Phi workshop at Liverpool

Action 6.2 – MD / AR - check regarding N8 HPC use of the Allinea licence

Initially Allinea said that N8 HPC users could use the Leeds instance of Allinea as it was installed on a “local” machine but they changed their minds and said that we would require a supercomputing licence. The only N8 HPC users that can use Allinea are Leeds N8 HPC users. MD has been on a Tau course at Hartree but it is a mess to install. Leeds has installed it on a local machine and are currently testing it and will hopefully install it on Polaris. It’s primarily a parallel profiler. They are also playing with MPI. RLP asked if there is a tool set we actually want to support and deliver training on? It could be standardised across N8. MD replied that we need to work out which tools are actually useful to people. RLP commented this is work Manchester have already done. Action is now closed.

Action 6.6 – Leeds - Prepare a news article on tips and tricks with specific reference to debugging, testing and development techniques.

MC has prepared an article on how to get jobs running quickly on Polaris which he has sent to GS. The above action has kind of been captured in the article but it needs more adding in on advanced reservations. CA will help with the article. A contact point should be included in the article for users to get in touch if they are still having problems.

New action 7.4 – MC - add in more on advanced reservations to the article on tips and tricks

New action 7.5 – CA – to contribute to article on tips and tricks

2. Reports

Report from each site

Business Development Engagement

The lack of engagement between the TMG and the BDMs has been raised at Steering Group and PDG. We need to reiterate to the BDMs that we’re not offering cycles, we’re offering expertise. We are having a BDM event in March (to be discussed later) which will emphasise this. AR asked how we can make sure that N8 HPC and local resources can be recognised in contributing to industry engagement successes. AR suggested that we ask the BDMs in advance of the event where HPC has made a difference in their industry relationships. This could be a simple 4 questions.

Action 7.6 – GS / PDG – to come up with questions for a BDM survey before meeting in March.

Training

There is lots of training taking place with the CDTs across N8. Newcastle areconsidering hosting a Software Carpentry course. York has a requirement for an advanced Python course.

Action 7.7 – DS –to write a news article on use of N8 HPC for CDT training which can be repurposed by TMG for local dissemination.

Services / Projects

Durham is now an IPCC but this is tied to Physics through the Institute for Cosmology who are optimising Gadgets and Swift for Xeon Phi. There is money to appoint someone to do this.

Leedsare currently advertising posts on redeployment. There is a vacancy for a Grade 8 advanced research computing post which is more user facing rather than the systems side. Chris Wareingwho did a lot of training but was only part time has now gone to Physics. A Grade 7 replacement for him starts next week. AR asked consultants from Red Oak to come in and examine his HPC strategy to see if it was up to scratch which it was.

Liverpool are looking for £1M to buy a building for a data centre to host approx. 50 racks half of which will be water cooled.

Action 7.8– CA to send AR and RLP any information about the data centres that he can.

Newcastle are pulling together all their research computing into one. Graeme is currently making a business case for someone to support research computing. They are thinking about having a fact finding day and inviting Newcastle people involved in institutional HPC to contribute their ideas.

York have been doing work on job queues and looking at the issues. It looks as though all sites are having jobs queued for some time including some queuing for up to 3 months. Liverpool’s jobs seem to be running very quickly even though they are using a lot of the machine. Worryingly a lot of jobs sent to N8 HPC terminate within 5 mins. For example Newcastle has a lot of jobs that run for less than 5 mins. AS will do further investigation into the types of jobs that people are running but people are waiting over 2 days for 256 core jobs to run. At Christmas when the load was lightand jobs should have been running almost straight away, a user still had to wait 2 days. This user was running 30,000 single task jobs at same time though.

Action 7.9 – AS to share data with Leeds and to meet with MD and discuss report. The data will then be circulated more widely and issues will be brought to the April meeting.

Steering group update

N8 HPC Network event

HPC for Quantum Materials Simulation took place at the University of York on the 9th Jan with around 50 people attended including PhD students. The event was deemed a success by those that took part and we are now looking at holding further events. We need academics to drive network events in the following research areas – bio-health, fluid dynamics, aerospace, Insigneo at Sheffield. CA suggested that the Farr machine could be a catalyst for a bio-health event.

Action 7.10 – TMG – Suggest possible lead academics for bio-health, fluid dynamics, aerospace and Insigneo at Sheffield network events.

Farr Facility

AR described two projects that John Ainsworth is involved in. One is to create a secure network layer between Manchester and Leeds and the second is to use Moonshot as an access mechanism on Farr facilities. Both projects come under the Safeshare project.

Durham Campus Visit

GS and RLP updated the group on the recent campus visit to Durham. One issue that was raised by several users is queuing times. GS asked if we need to manage users expectations? Do we need to give them an idea of “normal” waiting times etc.

Action 7.11 – Leeds – investigate queuing issues at Durham

Action 7.12 – GS - checkwith MD if it’s clear on the N8 HPC website about job running times.

Action 7.13 – Leeds - once Leeds and York have discussed the current issues, a “manage expectations” page / text should be added to website and newsletter etc.

Action 7.14 – GS –investigate a blog post / alternative way of passing on technical information to the users to help them tackle their issues.

3. Refresh Bid

The slides shown by Chris Taylor at the Steering Group meeting regarding the refresh bid were circulated during the meeting to TMG. There are many additions to be made to the slides due to suggestions from the SG so these are not the final headings. AR suggested that there may be announcement coming up connected to the spending review. It has also been suggested that we could make a request direct to Treasury.

Regarding architectures we are currently asking for one large machinebut this has not been finalised. AR stated that this is our default position. In the user survey etc everyone just asks for more of the same – no one is really suggesting alternative architectures / technologies. RLP asked what we were doing about big data? Could we put in an Open Stack machine? There is more scope for consolidation. Where is the compute for Turing going?

4. Shared Data Centre (AR / RLP)

AR explained that JANEThave been in touch regarding the Shared data centre in Slough where people can buy hosting capacity. Institutions currently involved include UCL, KCL, QM, Crick etc. The main driver was the e-med lab as they needed the space. JANET has a framework with costs for rack space but the overriding cost is amount of plant that you buy. RLP and AR have been sent information which includes the recurrent costs. N8 HPC seems like a natural group for JANET to engage with and RLP and AR have had various meetings with them. JANET have produced cost models and AR and RLP are currently comparing them toan “own build”. Durham has proposals for their own data centre as they need one irrespective of DIRAC acquisition. AR pointed out that just N8 HPC for the data centre would be a 10 year payback so we need to be confident that N8 HPC would last this long. RLP said that it would buy us 10 years to make the decision.

The next steps are JANET talking to the IT directors and also engaging with major co-lo providers in the region. AR and RLP met with Janet in November. JANET are currently putting a framework in place over next 6 months and then people can start buying capacity. There will be a workshop with IT directors and other interested parties and we need to make sure that the IT directors have visibility of this.

Action 7.15– GS – find email from RLP and ensure the IT directors meeting is moving along.

5. User survey

GS gave a potted highlights version of the first draft of the user report. There were several comments and questions from the TMG. It wasn’t clear what helpdesk one of the questions referred to – was it the local helpdesk or Leeds helpdesk? GS replied that it was the local helpdesk and that this will be made clearer in the final version. GS added that further investigation into the codes used by users was required. MD asked if people could only tick one box from – open source, own and commercial. GS will check and let the TMG know.

Action 7.16 – GS – to make it clearerthat users who used their local helpdesk were happy.

Action 7.17 – GS – let the TMG know the user choices in the question regarding code.

Action 7.18 - GS – once user report is finalised, draw up an action plan from users feedback on how we will respond and react.

6. Regional HPC Network

GS gave a brief overview of the project and an explanation of the work package that N8 HPC is responsible for, A joint bid from all 5 regional HPC centres was submitted to EPSRC and was awarded. The bid will make it possible for the 5 regional centres to act in a coordinated manner by collaborating in certain areas as well as providing funding for a coordinator, meetings and travel. There are 5 work packages each of which is led by a regional centre – N8 HPC is leading the training package.

•Training – N8 HPC - map the training that currently takes place at the five centres and investigate the scope for sharing resources, joint delivery across the sites, and synergy with other training schemes such as the Software Sustainability Institute, the HPC Short Courses Consortium, and appropriate sectors of the EPSRC CDT portfolio. (Lead: N8)

•Outreach to Universities outside the Regional Centres – Archie-West

•Common Access Control - SES

•Strategic Interactions - MidPlus

•Business Engagement – HPC Midlands

7. N8 HPC Business Development staff event

GS gave an overview of the day and the aims. TMG were asked to suggest possible researcher presenters who were N8 HPC user and worked with industry. Damian Lacroix from Sheffield and George Barakos from Liverpool being suggested with Gilberto Teobaldi a possible second choice.

Action 7.19 – GS - follow up with the suggested names for the Business Engagement event speakers

8. AoB

There was a query about support - formal support from SGI ends on the 30th July 2016.