DUNE Data Challenge 1.5

Executive Summary
We began at 06:00 Fermilab time on 1/19 and completed most file transfer activity by 10:00 on 1/20. We tested the full chain of data movement from the DAQ buffer in the experimental hall to tape at Fermilab. We have seen significant improvement over previous tests and see room for continued improvement. We transferred 27TB in all during this time.

EHN-1 to EOS phase:

Xrootd server installed on data buffer machine np04-srv-001
10 disks, RAID 5, 55 TB usable space (out of 40 that will eventually be configured).
2x10Gbit/s bonded interface dedicated link to Cern CC.
Had a set of mcc10 data, non-zero-suppressed detsim, copied to this machine
Shell script made symlinks to data in the FTS-light dropbox and made metadata files.
Expect that in production DAQ will put real files into the dropbox
FTS-Light deletes symlinks and/or files after file is moved.
Files were put into drop box by symlink to existing local files
We achieved a maximum throughput of 16Gbit/s using two machines, np04-srv-001 and np04-srv-003. (8Gbit/s on each).

Sourcing from disk on srv-001

Sourcing from RAM disk on srv-003

The routing was correctly done across the dedicated network to building 513.
One net block had been fixed Thursday
Originally believed the slow 8Gbit/s rate from np04-srv-001 was due to limitations on the disk array, but we then ran a test copying from RAM disk and got the same rate.
Geoff Savage has been assigned to further investigate why the current load balancing (bond0) configuration is not letting us get the full 20GBit/s output rate. We would not expect a single transfer to go that fast but would expect to be able to fill the pipe with 25 simultaneous transfers, which is what we were running.
FTS-Light was used to do the transfers. During the testing phase we discovered three places where the code was hanging up.
These have all been fixed and for the last 6 hours of the test we continuously transferred data with no further hangs and no human intervention.
FTS-light used xrdfs/xrdcp to copy/list/delete the files.
It was not possible to use third-party xrdcp to transfer from the xrootd server on np04-srv-001 to EOS.
We initially believed this might be an authentication issue but it turned out to be a newly discovered bug in the current version of EOS whereby it cannot do third-party xrdcp of large files 4GB and greater. (The same bug applies on EOS-to-EOS transfers).
There was an issue using a keytab to authenticate to kerberos. (Fixed as of Monday night).
Authentication for FTS-light
User credential for FTS-light ()
Kluge for DC1.5 was a manual copy of the /tmp/krb5cc* file
The monitoring of FTS-Light was not able to contact Fermilab because FTS-light is now on an internal network that is routable only within CERN. We knew this would happen--steps are being taken inside Fermilab to give us a web proxy gateway to send to, and we have a forwarder inside CERN as well.
Other monitoring that was tested and worked correctly:
Web access to FTS-light
Web access to FTS
CERN web monitoring tried by Xavi, posted in slack page

FTS Work with EOS
FTS-Light coming from EHN-1 was not enough to test full rate we wanted, so added 4 TB more to the dropbox all at once and started FTS cold.
FTS configured to send to Castor, final EOS location, and dCache
EOS control tower monitoring showed activity
FTS initially set to do 50 simultaneous copies (at one time) to each of Castor, final EOS location, and dCache
Crashed and burned - running as non-privileged user ran out of file descriptors
On own VM
Root on this VM, so can increase the limit up to limit that VM allows
25 each was stable
EOS to EOS rate
About 1 Gbyte/s EOS-to-EOS
Some third party xrdcp copies had errors due to bug in new version of EOS discovered this weekend. Leaving these files in place to retry when fix is in.
Aside: restart of FTS loses plot history, but info sent to FIFEMON
EOS to Castor
Active, worked
With 25 files, sustained 1.68 Gbytes/s to CASTOR
Due to FTS slow startup many of the CASTOR copies were done before the EOS-EOS and EOS-dCache got started.
Later versions of FTS will allow better control of this.
EOS to dCache
To R/W dCache
If scratch, FTS would wait forever for the file to make it to tape
(This can be configured differently…)
Average about 500 Mbyte/s
Had a peak > 1 Gbyte/s
But don't understand edge effects

What has to be addressed
Web Proxy for FTS-light to push out to FIFEMON
Joe Boyd has offered to help
Apparently some solution exists (polymur?)
RITM628463
FTS-light has to be modified to send an http POST
This might need to be done also for FTS
Fix keytab for FTS-light (Steve) (DONE)
Fix channel bonding (Geoff and CERN networks)
Test 3rd party transfer of large files once CERN states that bug fixed (EOS developers, Stu can test)
CERN incident 1572421
CERN RQF0926575
Fix limits on FTS VM to handle 50 files
Test narrowed down to send only to Fermilab and check dCache rates
Using Igor's script that copies from an EOS location back to the drop box
Install new version of FTS-light
Has graphs
Once 3rd party copies work, move back to srv-003
Explore why xrdfsrm doesn't like symlinks
Effects of DAQ writing to the disk buffer machine while we are reading remain to be explored.
nginx proxy configuration
Add additional prefixes to get to other instances of FTS-light, FTS

METRICS—where we stand:
Goal is 2.5GBytes/s – 20 Gbit/s through the whole system.
16Gbit/s from 2 machines not bad on the EHN1-EOS link, believe can get to 20 with further assistance from networking.
CERN-FNAL link –have 0.5GB/s sustained now (factor of 5 improvement over last time).
Can get factor of two by going to 50 transfers rather than 25.
Then will look at network config on FNAL end to be sure it is optimal, before trying to push to any higher rate, and will have to coordinate with dCache operations as well.
We benefitted over weekend on a dCache that was nearly idle otherwise, just after a long downtime before everyone had cranked up again.