DAQ crib sheet for physicists Draft, May 2017(tweaked July 2017)
This short document tries to describe the essential features for the DUNE DAQ as far as they are known to aid in writing simulations and to facilitate comments and ideas as the design progresses. New technology gets introduced which is why the design is still not completely fixed.
General properties: We continuously collect data from the TPC (at 2MHz) and photon detectors (PD) (at O(150MHz) currently). The TPC data arrives in units we will call‘patches’ corresponding to an electronics board corresponding to one connector on the feedthrough of the cryostat. This is currently one fifth of an APA on ProtoDUNE (i.e. a flange connector, also a WIB).
The description of the readout sequence uses the concept of a trigger with different levels, L1, L2, L3 etc. It is possible more or less of this processing is done in firmware or in software depending on which option is chosen. All the data is buffered losslessly[no zero suppression] while a small fraction containing info about location and size of the hits is sent to a L1 trigger processor which selects interesting regions of interest (ROI) [An L1 triggered eventis the set of APAs in the ROI in a certain time window]. Each L1 trigger causes an event number to be assigned and the corresponding lossless buffered data including the photon detectors for the entire L1 triggered event to be collected and sent to the L2. At L2, the data for the whole event can be looked at in one place for the first time (CPU, GPU, or even FPGA), and can either keep the whole lot, reject it or apply a cookie-cutter to remove uninteresting data. The hit information from L1 for this event will be included in the data stream, so can guide any cookie-cutting. The L2 events are then passed to L3 where a sophisticated signal processing scheme (a la Brett Viren) may be used in a computing farm to reduce the size of the events for permanent storage. All of the L1 and some parts of L2 are done in real time, and it is critical that the processes don’t block (i.e. have a fixed maximum time they take) to avoid later data being missed. The L1 latency is the time the uncompressed data can be kept before being overwritten; the L1 trigger must be delivered within this time.
Goals: The aim is to collect all the TPC data in a time window of at least one drift with no loss of ADC information at L1 and L2 for a generously sized ROI around all the following event types: neutrino beam, cosmic, atmospheric neutrino, proton decay candidate etc. To optimize the reach of low energy physics analyses, we have a second category of event where the threshold of acceptable hit patterns is lower, the ROI sent from L1 to L2 is smaller (but still at least 1 full APA) and the L2 rejection is tighter (to be optimized by future studies). The aim would be to have the bandwidth to be able to send a good bite into the combinatorial radioactivity background to L2, so that anything of higher energy than this can be kept. The collection of the full APA at L1->L2 will hopefully (to be studied) capture photon, neutron and Michel electron hits surrounding the main low energy event without those hits having to be identified individually at L1. The aim is that this low energy trigger forms one strand of our supernova data collection capability and be continuously active. Another supernova collection strategy that allows a central trigger decision that a supernova is occurring is outlined in some of the options described below.
Photon detectors: We need to think through photon detector triggering and readout more thoroughly. When an L1/L2 trigger from the TPC occurs, we could aim to collect individual PD hits (location, time, pulse height, not full waveform) and send to L2 for filtering. In parallel, we could foresee a trigger based on the fast photon detector signals (called e.g. L0) that acts rapidly in hardware, and allows the photon detector hardware to save waveforms for time windows of O(few usec) so that it has been saved for a TPC L1 trigger. We can also devise a trigger independent of the TPC L1 trigger, i.e. where a sufficiently compelling configuration of photon detector hit causes all data to be written to L2 of a suitably defined ROI. (Many studies are needed here, e.g. what window? what hit threshold? how much background is that? how to define the APA ROI from the photon hits?). This independent photon detector trigger is also an essential diagnostic (e.g. for the TPC trigger efficiency measurement).
We now run through some options for implementation,the ordering does not indicate favoritism. The buffering capacity and the needed latency for the above types of triggers are examples of the parameters that are important for the simulation. Note the raw uncompressed data rate is 7.68GB/s/APA or 1.6GB/s/patch.
Option 1: Each patch-board (1/5 APA as described in introduction) contains an FPGA, 800MB of buffer memory and a 1Gbit/s Ethernet (or similar) link. The FPGA applies a lossless factor-4-ish compression algorithm and hunts for hit combinations to send to the L1 trigger. This memory size allows a buffer of 2 secs, which could be divided into a 0.5sec ring buffer and 1.5sec of longer-term buffering. Assume a trigger hit is represented by 32 bits (say 9 for channel#, 10 for time (within a 0.5us window) and 8 for height and 4 for width). [We apply a hard limit on the number of trigger hits, say one hit per channel per 0.5us; anything above this causes the event to automatically be sent to L2 to filter obvious pickup zinging]. The L2 farm only receives data triggered by L1 and so can be smallish. The L1 computation time is important and must be less than the ring-buffer time.
Option 2: Compared to option 1, increase the outbound network bandwidth to 4 x 1Gb/s links (or one 10Gb/s link). This now allows all data to be sent to the L2 data farm compressed. The FPGA still does the trigger hit finding for L1. This allows a much larger memory buffer in a computer, to extend the L1 latency and the area for supernova triggers (to tens of seconds). The FPGA must still compress. Not much memory is needed on the FPGA. We could also extend to 2 x 10Gb/s so the FPGA does not need compression, [or 10Gb/s so we only need x2 compression at L1] (de-risk having extra noise to cope with) and compress better on L2 computer.
Option 3: Similar to option 2, but the FPGA does not do the trigger hit finding for L1. This is a significant downgrade compared to option 2, because the computers receiving all the data need to read all the data (lots of processing needed, especially if decompression is needed).
Option 4: Increase memory buffering on FPGA compared to option 1 e.g. 16GB which allows 32s of storage. This is an alternative way to achieve the extensions listed for option 2.
Note that option 1 is inspired by the RCE/COB system from SLAC, indeed it can be run on the existing Gen3 RCE system. Option 3 is comparable to FELIX and could be implemented with FELIX PCIe cards or with Ethernet switching. Option 2 or option 4 fit between these implementations, e.g. it could be a customized DUNE-RCE merged with an upgraded WIB, running the FELIX philosophy, (i.e. FELIX but replacing the GPB-X with Ethernet), or some combination like that.
Suggestion: From the physics perspective, we should try to fit within option 1 as far as latency and feasibility tests of software is concerned, as this is the most basic option from the hardware point of view. SNB studies may want/need to extend beyond this a bit. From the study of the hardware choices, we should strive for something in the direction of option 2, which is the most flexible and upgradable of the options given.