-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulation Format Naming Convention #1
Comments
I summarized my "proposal" from the call. I modified already some names. 3 Stages of Simulation Data1. Stage:
|
Group/Column Name | Data Format (each row) | Dimension | Description | example |
---|---|---|---|---|
mc_evt_id |
integer | 1 |
MC Event ID | 1 |
det_id |
integer | 1 |
MC Detector ID of the detector in which the hit (energy deposition) occurs | 1 |
pos |
3 element vector | length |
Position of the hit | [0.01, 0.02, 0.014] * u"m" |
edep |
float | energy |
Deposited energy at the hit position | 1460.0 * u"keV" |
thit |
float | time |
Timestamp of the hit | 124.0 * u"s" |
Time cluster by about 1ns
(all energy depositions within 1ns
will be drifted together).
The mean of the individual time stamps (clustered together) will be the timestamp
of the germanium-detector-event in the next stage of the simulation.
2. Stage: mcpss
Output of SSD / FieldGen+SigGen - Input to Electronics+DAQ
Group/Column Name | Data Format (each row) | Dimension | Description | example |
---|---|---|---|---|
det_evt_id |
integer | 1 |
Detector Event ID of the germanium-detector-event. This event can be build up from multiple MC hits | 1 |
det_id |
integer | 1 |
Detector ID of the detector | 1 |
chn_id |
integer | 1 |
Channel ID to which the waveform belongs | 1 |
pos |
vector of 3-element-vectors | length |
Stores all hit positions from which the waveform was generated | [[0.01, 0.02, 0.014] * u"m"] |
edep |
vector of floats | energy |
Stores all hit energies from which the waveform was generated | [1460.0 * u"keV"] |
thit |
float | time |
Timestamp of the germanium-detector-event. This might be the mean of the MC timestamps of the individual hits from which this waveform was generated. This will be used for event building in the next stage. | 124.0 * u"s" |
waveform |
LegendWaveformFormat | time and Charge |
Generated Waveform for the respective channel and hits. These waveforms have arbitrary lengths (due to different drift times). They should be in units of charge in my opinion (deposited energy / ionization energy of germanium) | 0:4:20000 * u"ns" and rand(5000) * u"C" |
3. Stage: t1pss
(mimics tier1 v01.00
real raw data format)
Well, this is already fixed. We just could add some additional information like true_energy
.
|
Hi Mariia (@sagitta42), all, I'd like to suggest an alternative to the proposed naming convention. It would be nice I think if there were parity between the naming schemes for MC and data. In data we do not have "tier1" and people will wonder whether that means daq (the first form of recorded detector data), raw (the first type of data for analysis), or dsp (if raw is considered "tier 0") until they figure out it's meant to correspond to raw. Then they might be confused again because "mcraw" is something else entirely. In the data we have the tiers [daq] where daq is in brackets because we anticipate to delete it once the raw tier is generated. So if we have MC generate a file that is meant to be identical in structure to our "raw" data from the detectors, I think that is what we should call "mcraw." Then the dsp file generated from that would be "mcdsp", and so on. In that case, the file you proposed to call "mcraw" would need a new name. What it contains is "stepping information" from the simulations. I think there is value in keeping the field widths uniform in the file names, so I propose to use the 5-character label "mcstp" for that. Note that "mcpss" conforms to this scheme already. So I suggest to use the following names for sim tiers: mcstp where G4 is used to generate mcstp, PSS is used to generate mcpss, electroncs + daq sims generate mcraw, and pygama generates the subsequent tiers. Further, I think we should structure mcpss so that tables from that can be joined row-by-row with tables from mcpss to mchit. In that case, the fields det_id and chn_id are not necessary because the former will be in the hdf5 group name, and the latter will be in the channel map. To build events the mcpss step will have to generate a time coincidence map as well (which may be further modified by the daq sim). I hope to get the version for data prototyped soon so that you can see what it's meant to look like. In the meantime, refer to the data handling doc for details: Note that the fact that there are two "tiers" of MC data before one gets to "tier 1" for the data shows an example of why it is wise to use names rather than numbers to refer to tiers (thanks @oschulz!). Best, |
Hi Jason (@jasondet), I agree on your comment regarding possible confusion between stp As, in my opinion, PyGama should not know whether the data it is analyzing is simulated or real data. Regarding the structure of My understanding of event building: The fields |
Hi @jasondet, We were told to choose the same length for parsing reasons, corresponding to the 5-letter names "tier1", "tier2" and so on. As for the group names, they will mimic those in data, as @lmh91 pointed out. For example, After data format conversion (and in principle not only, we have been talking about a more electronics simulation), we obtain After this point, there is no more simulation to be done. As for |
Lukas (@lmh91) -- yes, I would be fine with dropping "mc" from all the names, as long as its clear elsewhere in a file key / path that the file contains mc, not data. And yes, you are correct, I had missed that in general there is no 1-to-1 mapping between pss and raw. I think putting pss_event_id in the raw file is a good suggestion that should be easy to implement. As for how event building will be handled in general -- for tiers prior to "evt" we will have a "time coincidence map" (tcm) that basically keeps lists of row numbers that correspond to raw/dsp/hit data from the same event. For the evt tier the data will already appear in a built structure that can be joined with the tcm for linking to the previous-tier data. I hope to have a prototype of the tcm ready soon so that people can see better how it will work. Mariia (@sagitta42) -- I'm confused by your statement "We were told to choose the same length for parsing reasons, corresponding to the 5-letter names 'tier1', 'tier2' and so on." We do not use the names "tier1", "tier2" etc in LEGEND. We use the 3-letter names daq, raw, dsp, hit, and evt. See the Data Handling doc I linked in my last post or my Analysis Overview talk at the last CM for details. I'm curious what data will be in the mctruth and electruth tables. We had envisioned having only one major group per file but I'm not opposed to adding more tables for mc. However I would have thought that parameters used in simulations should be in the simulation code / config files or in the database, and one wouldn't need to write them out repeatedly for each row in the output simulation. Maybe I misunderstood the proposal. In my mind the data in the stp and pss tiers -is- the mc truth. |
I fully agree with Lukas - we don't want to handle different tier names in the processing pipleline for physical and simulated data. |
Hi everyone, I understand now. In that case, I think @jasondet's suggestion is the best: Thank you for the great suggestion! |
Hey all, sorry for joining this discussion late, I wasn't following this repo before. I have a few comments/questions, mostly about the steps tier.
Thanks, |
Indeed, and yes, one of the first steps in postprocessing of Geant4 (MaGe) output should be generation of realistic time stamps (given event rate parameters as additional input data). That way, we can also simulate things like pile-up, and test the ability of our analysis chain to deal with it.
Yes, clustering is definitely another pre-processing step, like shown in the LEGEND Julia tutorial. Though we're now also looking into clustering less, for SSD, to have more detailed charge clouds when simulating charge-cloud self-interaction (still early days, and does of course come with a computational cost).
Yes, we're spoken quite a bit about this in the last pulse-sim call. Mariia and other are currently figuring out what exactly we need to propagate and in which data tiers.
Depending on whether the pulse-sim package accounts for this already (we're trying to teach SSD to do dead-layer effects, and I thing siggen can already do this to some extent), there should definitely a step of optional heuristics like that. Depending on how it's done (on the energy of waveform), it'll need to happen directly before or after pulse simulation, so ideally this will really be part of the pulse-sim packages themselves or the legend-specific wrapper code we'll use to call them. |
I guess we should update this :) Currently used names are: |
Yes, it shouldn't be something software/product specific, like g4 is |
This is quite outdated now. The current format uses three-letter names as in data processing:
The rest are the same as discussed here, and |
Follow up discussion (https://indico.legend-exp.org/event/477/) for the naming convention of the simulation file format
including detailed definition what is saved at different stages and how.
The text was updated successfully, but these errors were encountered: