The Superb Data and Computing Model

The SuperB Data and Computing Model

Machine performance

The model is based on the following machine performances. The Beam Time assumes 9 months of operations per year and 19.3 days per month.

Year / Beam time
(seconds/year) / Luminosity
(pb-1s-1) / Integrated luminosity
(ab-1 per year)
2016 / 1.5x107 / 0.25 / 3.75
2017 / 1.5x107 / 0.7 / 10.5
2018 onwards / 1.5x107 / 1.0 / 15.0

Table 1: SuperB operation parameters used in this document

Event Model

The cross section of the recorded events is14 nbi.e. 1.4 x 1010 ev/ab-1(calculated from Fabrizio’s presentation assuming 200 KB/RAW event; 2.6 PB/ab-1 of RAW data. It should be justified by physics as a starting parameter). The size of the recorded events (RAW data) is assumed to be a factor of 3 bigger than the Babar one, i.e. 200 KB.

SuperB foresees a data model similar to Babar. Besides the RAW data there aretwo reconstructed data formats (tiers): the Mini and the Micro produced by the reconstruction program. Their sizes are estimated to be a factor of 2 bigger than the Babar ones.

Monte Carlo simulation produces events similar to the Mini and the Micro with sizes assumed to be also a factor of 2 bigger than the Babar ones. The amount of MC data produced each year is foreseen to be the same of the data recorded from the detector.

Both data and Monte Carlo events are further processed to produce data samples that are the main input for the analysis (Skims). Their sizes are variable and the overall size is assumed to be a factor of 5 bigger than the corresponding Micro.

Data tier / Description / Size
(KB/evt) / Size (TB/ab-1)
RAW / Events produced by the detector / 200 / 2625
Mini / Reconstructed events / 12 / 160
Micro / Reduced size reconstructed events / 6.3 / 84
MC-Mini / Output of MC simulation / 11.8 / 156
MC-Micro / Reduced size simulated events / 7.7 / 102
Skims (data+MC) / Data optimized for group and user analysis / variable / 930

Table 2: Event sizes

Are all events grouped in a unique data stream? In case there are more streams, is there an overlap factor?

Event processing

All processing powers are calculated assuming a factor of 3more than Babar. The different processing steps are:

Processing / Description / CPU (HS06 s /event) / CPU (KHS06 s /pb-1)
Reconstruction / Processing of RAW data to produce Mini, Micro and Calibrations / 18.9 / 265.2
MC production / Generation of events and production of Mini and Micro / 60 / 840
Skimming / Production of Skims from Mini and/or Micro / 8.7 / 122.4

Table 3: Productions processing CPU needs

Reprocessing requires the presence of a disk buffer where to stage in the RAW.

The needs for analysis are extrapolated from the Babar experience as function of the collected luminosity, assuming the usual factor 3 for the processing time:

Processing / Description / CPU power (KHS06 /ab-1)
Data analysis / User analysis of data Mini/Micro/Skims / 14.4
MC analysis / User analysis of MC Mini/Micro/Skims / 15.6

Table 4: Analysis processing CPU needs

Event Flow

The data acquisition system should cope with the foreseen luminosities and recording cross section.The first processing of RAW data should happen at the same rate of data acquisition, possibly with an accepted latency.

Is there an express stream?

The data recording buffer at the experiment site should be able to keep 48 hours of data, while it waits to be transferred to the site that performs the first reconstruction.

Year / Recorded
σ (nb) / Luminosity
(pb-2s-1) / Event rate
(KHz) / Data rate
(GB/s) / 1st processing CPU power (KHS06) / Buffer size
(TB)
2016 / 14 / 0.25 / 3.5 / 0.67 / 66 / 114
2017 / 14 / 0.7 / 9.8 / 1.87 / 186 / 316
2018 onwards / 14 / 1.0 / 14.0 / 2.67 / 265 / 451

Table 5: Needs for the real-time operations

RAW data are stored on tape in at least two different sites that take responsibility for their custodial. The first reconstruction is done at one of the sites. During the first year of data taking it is foreseen to keep the whole RAW sample on disk while the understanding of the detector improves. During the following years the same disk pools will be used to host only a fraction of the RAW data and to provide the buffer for the reprocessing.

Skimming and subsequent reconstructions are done at designated sites as well as Monte Carlo Simulation and corresponding skimming and re-reconstructions.

Each active copy of Mini, Micro and Skims is stored onat least 2 sites that take responsibility for their custodial. It is still to be decided whether one of the two copies will be on tape or both will be on disk. It is foreseen to have at any time 2 active versions of each sample of Mini, Micro and Skim. The following table shows the foreseen multiplicity of each data sample on the computing infrastructure.(The following is an attempt)

Sample / Active copies / Multiplicity
(of each active copy)
on disk / Multiplicity
(of each active copy)
on tape
RAW / 1 / 1 in 2016
Only a fraction later / 2
Mini / 2 / 2 (or 1) / 0 (or 1)
Micro / 2 / 4 / 0
MC-Mini / 2 / 2 / 0
MC-Micro / 2 / 4 / 0
Skims / 2 / 2 / 0

Table 6: Multiplicity of data samples on the computing infrastructure

Monte Carlo Simulation

It is foreseen to have as many Monte Carlo simulated data as real data. The outputs of the simulation program are Mini and Micro samples similar to those produced for real data but with in addition information about the Monte Carlo truth.

Describe full and fast simulation programs

Analysis Model

User analysis of any data sample is done mainly at the sites that host the dataeven though remote access may be foreseen in special cases. User output is stored at designated sites. Interactive support to users is provided by local institutes. It is foreseen that users are able to access any data tier but it is expected that most of the analysis will be done on Micro and Skims.

Are there event-directories? What is the weight of remote data access?

Non event data

Describe the flow of conditions, calibration and alignment data.

Middleware and tools

Describe the main assumptions about the services available on the distributed infrastructure to have access to the resources.

Computing Model

The following is just a first attempt to provide a description of the infrastructure. All numbers are subject to revision (also heavy) as a consequence of detailed simulations and formal agreements with sites.

The following matrix provides the list of responsibilities as described in the previous section associated with each category of sites with the percentages of the responsibility. In case of data hosting (on disk if not explicitly specified) the total is 100% times the multiplicity of the corresponding sample.

Functionality
vs
Site / CabibboLab / CNAF / PON sites
(Aggregated) / Tier-1s
(Aggregated) / Tier-2s
(aggregated) / Total
DAQ buffer / 100% / 100%
Calibration / 100% / 100%
1st reconstruction / 50% / 50% / 100%
RAW data custodial (on tape) / 100% / 100% / 200%
RAW data buffer / 25% / 50% / 25% / 100%
Re-reconstruction / 25% / 50% / 25% / 100%
MC Production / 25% / 25% / 25% / 25% / 100%
Mini data hosting (MC and data) / 50% / 50% / 50% / 50% / 200%
Micro data hosting (MC and data) / 100% / 100% / 100% / 100% / 400%
Skimming / 25% / 50% / 25% / 100%
Skimmed data hosting (MC and data) / 100% / 100% / 100% / 100% / 400%
User analysis (MC and data) / 40% / 20% / 40% / 100%
User data hosting / 20% / 20% / 60% / 100%

Table 7: Matrix of functionalities and sites offering them

CabibboLab, where the experiment is hosted, provides a buffer to host the data for at least 48 hours and the CPU to perform the calibration with a very low latency

CNAF has a tape library that hosts a full copy of the raw data and performs a fraction of the first reconstruction. A fraction of the RAW data is kept on disk for further processing steps and re-reconstructions. Furthermore CNAF participates to MC production and skimming activities and hosts a fraction of Mini, Micro and skimmed data for redistribution to other sites.

The PON centres in Italy have a similar functionality to CNAF but do not have a tape library so do not offer RAW data custodial functionalities. In addition the PON centres offer CPU power to do user analysis and host part of the user data.

If compared with the LHC computing models, Cabibbolab, CNAF and the PON sites offer the functionalities of a Tier-0 and a Tier-1.

It is foreseen to have X sites outside Italy with tape libraries to provide custodial for a second copy of RAW data and all the other production-like functionalities typically offered by the LHC Tier-1s. In addition the Tier-1 centres offer a fraction of CPU power to do user analysis and host part of the user data.

Tier-2s contribute to MC production but their main role is to support user analysis.

Tier-3s are provided by each member institute and provide the interactive access to the computing resources. These are not included in the model.

Resources

The simulation (AKA excel sheet) provides the parameters for each site category.

Page 1