DUNE Data Challenge 1.5
- Executive Summary
- We began at 06:00 Fermilab time on 1/19 and completed most file transfer activity by 10:00 on 1/20. We tested the full chain of data movement from the DAQ buffer in the experimental hall to tape at Fermilab. We have seen significant improvement over previous tests and see room for continued improvement. We transferred 27TB in all during this time.
- EHN-1 to EOS phase:
- Xrootd server installed on data buffer machine np04-srv-001
- 10 disks, RAID 5, 55 TB usable space (out of 40 that will eventually be configured).
- 2x10Gbit/s bonded interface dedicated link to Cern CC.
- Had a set of mcc10 data, non-zero-suppressed detsim, copied to this machine
- Shell script made symlinks to data in the FTS-light dropbox and made metadata files.
- Expect that in production DAQ will put real files into the dropbox
- FTS-Light deletes symlinks and/or files after file is moved.
- Files were put into drop box by symlink to existing local files
- We achieved a maximum throughput of 16Gbit/s using two machines, np04-srv-001 and np04-srv-003. (8Gbit/s on each).
Sourcing from disk on srv-001
Sourcing from RAM disk on srv-003
- The routing was correctly done across the dedicated network to building 513.
- One net block had been fixed Thursday
- Originally believed the slow 8Gbit/s rate from np04-srv-001 was due to limitations on the disk array, but we then ran a test copying from RAM disk and got the same rate.
- Geoff Savage has been assigned to further investigate why the current load balancing (bond0) configuration is not letting us get the full 20GBit/s output rate. We would not expect a single transfer to go that fast but would expect to be able to fill the pipe with 25 simultaneous transfers, which is what we were running.
- FTS-Light was used to do the transfers. During the testing phase we discovered three places where the code was hanging up.
- These have all been fixed and for the last 6 hours of the test we continuously transferred data with no further hangs and no human intervention.
- FTS-light used xrdfs/xrdcp to copy/list/delete the files.
- It was not possible to use third-party xrdcp to transfer from the xrootd server on np04-srv-001 to EOS.
- We initially believed this might be an authentication issue but it turned out to be a newly discovered bug in the current version of EOS whereby it cannot do third-party xrdcp of large files 4GB and greater. (The same bug applies on EOS-to-EOS transfers).
- There was an issue using a keytab to authenticate to kerberos. (Fixed as of Monday night).
- Authentication for FTS-light
- User credential for FTS-light ()
- Kluge for DC1.5 was a manual copy of the /tmp/krb5cc* file
- The monitoring of FTS-Light was not able to contact Fermilab because FTS-light is now on an internal network that is routable only within CERN. We knew this would happen--steps are being taken inside Fermilab to give us a web proxy gateway to send to, and we have a forwarder inside CERN as well.
- Other monitoring that was tested and worked correctly:
- Web access to FTS-light
- Web access to FTS
- CERN web monitoring tried by Xavi, posted in slack page
- FTS Work with EOS
- FTS-Light coming from EHN-1 was not enough to test full rate we wanted, so added 4 TB more to the dropbox all at once and started FTS cold.
- FTS configured to send to Castor, final EOS location, and dCache
- EOS control tower monitoring showed activity
- FTS initially set to do 50 simultaneous copies (at one time) to each of Castor, final EOS location, and dCache
- Crashed and burned - running as non-privileged user ran out of file descriptors
- On own VM
- Root on this VM, so can increase the limit up to limit that VM allows
- 25 each was stable
- EOS to EOS rate
- About 1 Gbyte/s EOS-to-EOS
- Some third party xrdcp copies had errors due to bug in new version of EOS discovered this weekend. Leaving these files in place to retry when fix is in.
- Aside: restart of FTS loses plot history, but info sent to FIFEMON
- EOS to Castor
- Active, worked
- With 25 files, sustained 1.68 Gbytes/s to CASTOR
- Due to FTS slow startup many of the CASTOR copies were done before the EOS-EOS and EOS-dCache got started.
- Later versions of FTS will allow better control of this.
- EOS to dCache
- To R/W dCache
- If scratch, FTS would wait forever for the file to make it to tape
- (This can be configured differently…)
- Average about 500 Mbyte/s
- Had a peak > 1 Gbyte/s
- But don't understand edge effects
- What has to be addressed
- Web Proxy for FTS-light to push out to FIFEMON
- Joe Boyd has offered to help
- Apparently some solution exists (polymur?)
- RITM628463
- FTS-light has to be modified to send an http POST
- This might need to be done also for FTS
- Fix keytab for FTS-light (Steve) (DONE)
- Fix channel bonding (Geoff and CERN networks)
- Test 3rd party transfer of large files once CERN states that bug fixed (EOS developers, Stu can test)
- CERN incident 1572421
- CERN RQF0926575
- Fix limits on FTS VM to handle 50 files
- Test narrowed down to send only to Fermilab and check dCache rates
- Using Igor's script that copies from an EOS location back to the drop box
- Install new version of FTS-light
- Has graphs
- Once 3rd party copies work, move back to srv-003
- Explore why xrdfsrm doesn't like symlinks
- Effects of DAQ writing to the disk buffer machine while we are reading remain to be explored.
- nginx proxy configuration
- Add additional prefixes to get to other instances of FTS-light, FTS
- METRICS—where we stand:
- Goal is 2.5GBytes/s – 20 Gbit/s through the whole system.
- 16Gbit/s from 2 machines not bad on the EHN1-EOS link, believe can get to 20 with further assistance from networking.
- CERN-FNAL link –have 0.5GB/s sustained now (factor of 5 improvement over last time).
- Can get factor of two by going to 50 transfers rather than 25.
- Then will look at network config on FNAL end to be sure it is optimal, before trying to push to any higher rate, and will have to coordinate with dCache operations as well.
- We benefitted over weekend on a dCache that was nearly idle otherwise, just after a long downtime before everyone had cranked up again.