MuonFitter RNNFit model scripts (#290)

* RNNFit model for MuonFitter * Fixed read in of Fit script * moved model files out and added some documentation for model generation * removed extract_partnumber.py and added info to README for MuonFitter. Also changed location of MuonFitter README from MuonFitter/RNNFit/ to MuonFitter/ * Update README.md --------- Co-authored-by: James Minock <jminock@anniegpvm02.fnal.gov> Co-authored-by: marc1uk <marc1uk_@hotmail.com>
ANNIEsoft · Nov 19, 2024 · 88afcec · 88afcec
1 parent b1d0be9
commit 88afcec
Show file tree

Hide file tree

Showing 5 changed files with 392 additions and 0 deletions.
diff --git a/configfiles/MuonFitter/README.md b/configfiles/MuonFitter/README.md
@@ -0,0 +1,43 @@
+#MuonFitter Config
+
+***********************
+#Description
+**********************
+
+Date created: 2024-10-02
+The MuonFitter toolchain makes an attempt to fit muons using hit information.
+
+The Tool has 2 modes. The first mode is pre-reconstruction. It takes input information from the ANNIEEvent and generates a text file containing hit information for the RNN. It is advisable to include minimal tools in this ToolChain, as the same data must be re-analysed with ToolAnalysis later.
+
+This text file produced in this step is then processed by a standalone python script (Fit_data.py), which outputs fitted information into a new text file.
+
+The second mode is reconstruction. In this mode the Tool reads information from the ANNIEEvent, along with both text files from the previous two steps (the first mode and the python script) and reconstructs the vertex based on the fitted paths. The resulting muon fit information is passed into the DataModel.
+
+More detailed instructions are below.
+
+************************
+#Usage
+************************
+
+To generate (train) a model:
+============================
+
+1. Data_prepare.py
+
+2. RNN_train.py
+
+NOTE: Data_prepare.py requires input files that do not yet exist on the ANNIE gpvms; as a result the model cannot at present be re-trained.
+Previously generated models are stored in /pnfs/annie/persistent/simulations/models/MuonFitter/ which may be used.
+
+Please update any paths in the Tool configuration and Fit_data.py accordingly, or copy the appropriate model files to the configfiles/MuonFitter/RNNFit directory.
+
+To analyse data:
+================
+
+1. First, run a ToolChain containing the MuonFitter Tool configured in "RecoMode 0". This will generate a file: ev_ai_eta_R{RUN}.txt with a {RUN} number corresponding to the WCSim run number. You should not include any Tools further along the ToolChain for this step.
+
+2. Second, run "python3 Fit_data.py ev_ai_eta_R{RUN}.txt". This will apply the fitting and generate another textfile to be ran in ToolAnalysis: tanktrackfitfile_r{RUN}_RNN.txt. Again: please update any paths such that all files and models are available.
+
+NOTE: the script RNNFit/rnn_fit.sh can act as a helper to process multiple ev_ai_eta_R{RUN}.txt files. Be sure to update the path in the script to point to your ev_ai_eta text files from step 1.
+
+3. Finally, run a ToolChain containing the MuonFitter Tool configured in "RecoMode 1". Please set the paths for the ev_ai_eta_R{RUN}.txt and tanktrackfitfile_r{RUN}_RNN.txt in the MuonFitter config file accordingly. You may include any downstream tools you desire for further analysis. See the UserTools/MuonFitter/README.md for short descriptions of information saved to the DataModel and how to access them.
diff --git a/configfiles/MuonFitter/RNNFit/Data_prepare.py b/configfiles/MuonFitter/RNNFit/Data_prepare.py
@@ -0,0 +1,49 @@
+# coding: utf-8
+
+# # Import modules
+import pandas as pd
+import json
+
+
+# # Prepare data into pandas DataFrame
+dataX = pd.read_csv("/home/jhe/annie/analysis/Muon_vertex/X.txt",sep=',',header=None,names=['id','ai','eta'])   #ai is track segment
+dataY = pd.read_csv("/home/jhe/annie/analysis/Muon_vertex/Y.txt",sep=',',header=None,names=['id','truetracklen'])
+# dataX['combine'] = dataX[['X','Y']].values.tolist()
+
+
+# ### Preview dataframes
+dataX.head(5)
+dataY.head(5)
+
+
+# # Aggregate data and filter out extremely long and negative tracks
+grouped = dataX.groupby('id').agg(list).reset_index()
+data = pd.merge(grouped, dataY, on='id')
+print("after merge: " + str(len(data)))
+criteria = data['truetracklen'] > 1000
+data = data[~criteria]
+print("after first filter (>1000): " + str(len(data)))
+critiera = data['truetracklen'] < 0
+data = data[~criteria]
+print("after second filter (<0): " + str(len(data)))
+print(data.columns)
+data.head(10)
+
+
+# # Prepare data into json format
+json_data = data.to_json(orient='index')
+
+
+# # Save json data into .json file
+file_path = './data.json'
+with open(file_path, 'w') as json_file:
+    json_file.write(json_data)
+
+
+# # Save pandas DataFrame as csv file
+data.to_csv('./data.csv', index=False)
+
+
+# # Save data into h5 file << USE THIS ONE
+data.to_hdf("./data.h5",key='df')
+
diff --git a/configfiles/MuonFitter/RNNFit/Fit_data.py b/configfiles/MuonFitter/RNNFit/Fit_data.py
@@ -0,0 +1,77 @@
+# coding: utf-8
+import numpy as np
+import pandas as pd
+import torch
+from torch.utils.data import DataLoader
+from torch.utils.data import Dataset
+import torch.nn as nn
+import torch.optim as optim
+from sklearn.model_selection import train_test_split
+import matplotlib.pyplot as plt
+import time
+import sys
+
+if (len(sys.argv) != 2):
+  print(" @@@@@ MISSING ev_ai_eta_R{RUN}.txt FILE !! @@@@@ ")
+  print("   syntax: python3 Fit_data.py ev_ai_eta_R{RUN}.txt")
+  print("   path: ~/annie/analysis/FILE")
+  exit(-1)
+
+DATAFILE = sys.argv[-1]
+
+# ## Extract run number from filename
+# ## NOTE: Assumes filename has this structure: ev_ai_eta_R{RUN}.txt
+RUN = DATAFILE[12:-4]
+
+# ## Define ManyToOneRNN class
+class ManyToOneRNN(nn.Module):
+    def __init__(self, input_size, hidden_size, num_layers, output_size):
+        super(ManyToOneRNN, self).__init__()
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True,nonlinearity='relu')
+        self.fc = nn.Linear(hidden_size, output_size)
+
+    def forward(self, x):
+        # Initialize hidden state with zeros
+        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
+        # Forward propagate the RNN
+        out, _ = self.rnn(x,h0)
+
+        # Decode the hidden state of the last time step
+        out = self.fc(out[:, -1, :])
+        return out
+
+
+# ## Load model
+model = torch.load('model.pth')
+model.eval()
+
+
+# ## Load data (needs to be in Tensor format)
+data = pd.read_csv(DATAFILE, header=None, names=['evid','cluster_time','ai','eta'])
+print(data.head(5))
+
+data = data.groupby(['evid', 'cluster_time']).agg(list).reset_index()
+print(data.head(5))
+
+print(data.iloc[0,2:])
+
+# ## Do the fit
+# open output file
+OUTFILENAME = "tanktrackfitfile_r" + RUN + "_RNN.txt"
+out_f = open(OUTFILENAME, "a")
+
+for idx in range(len(data)):
+    dataT = torch.tensor(data.iloc[idx,2:]).t()
+    dataT.unsqueeze_(0)
+    out = model(dataT)
+    print(data.iloc[idx,0], out)
+    out_f.write(str(data.iloc[idx,0]) + "," + str(data.iloc[idx,1]) + "," + str(out.data.numpy()[0][0]) + "\n")
+
+# close output file
+out_f.close()
+
+############################################################
+############### EOF 
+############################################################
diff --git a/configfiles/MuonFitter/RNNFit/RNN_train.py b/configfiles/MuonFitter/RNNFit/RNN_train.py
@@ -0,0 +1,214 @@
+# coding: utf-8
+
+# ## Import modules
+import numpy as np
+import pandas as pd
+import torch
+from torch.utils.data import DataLoader
+from torch.utils.data import Dataset
+import torch.nn as nn
+import torch.optim as optim
+from sklearn.model_selection import train_test_split
+import matplotlib.pyplot as plt
+import time
+
+
+# ## Load data, using half of the dataset to train
+data = pd.read_hdf("data.h5", 'df')
+train_df, test_df = train_test_split(data, test_size=0.5)
+test_df, CV_df = train_test_split(test_df, test_size=0.5)
+# validation used to keep track of training set; monitor accuracy
+
+# ### Preview data
+print("len train_df: " + str(len(train_df)))
+train_df.head(5)
+
+print("len test_df: " + str(len(test_df)))
+test_df.head(5)
+
+print("len CV_df: " + str(len(CV_df)))
+CV_df.head(5)
+
+print(data[data['truetracklen'] < 0.])
+
+
+# ## Define MyDataset class
+class MyDataset(Dataset):
+    def __init__(self, dataframe):
+        self.data = dataframe
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, idx):
+        evid = self.data.iloc[idx,0]    #added to include evid in output file
+        features = torch.tensor(self.data.iloc[idx, 1:-1], dtype=torch.float32)
+        features = features.t()
+        target = torch.tensor(self.data.iloc[idx, -1], dtype=torch.float32)
+        return evid, features, target   #added evid as return value
+
+# ### Prepare data for training
+batch_size = 1 # Adjust batch size as needed
+# sequence_length = 3  # Adjust sequence length as needed
+shuffle = True  # Shuffle the data during training (recommended)
+dataS = MyDataset(data)
+train = MyDataset(train_df)
+test = MyDataset(test_df)
+CVS = MyDataset(CV_df)
+trainloader = DataLoader(train, batch_size=1, shuffle=shuffle) #how much data to train per epoch
+testloader = DataLoader(test)
+dataloader = DataLoader(dataS, batch_size=1, shuffle=shuffle)
+CVloader = DataLoader(CVS)
+
+print(dataloader)
+
+
+# ## Define ManyToOneRNN class
+class ManyToOneRNN(nn.Module):
+    def __init__(self, input_size, hidden_size, num_layers, output_size):
+        super(ManyToOneRNN, self).__init__()
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True,nonlinearity='relu')
+        self.fc = nn.Linear(hidden_size, output_size)
+
+    def forward(self, x):
+        # Initialize hidden state with zeros
+        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
+        # Forward propagate the RNN
+        out, _ = self.rnn(x,h0)
+
+        # Decode the hidden state of the last time step
+        out = self.fc(out[:, -1, :])
+        return out
+
+# ### Set parameters for training model
+cost_list = []
+CVcost_list = []
+input_size = 2
+hidden_size = 4
+num_layers = 1
+output_size = 1
+learning_rate = 0.001   #can tune this for better model
+num_epochs = 10000   #default:10000
+
+model = ManyToOneRNN(input_size, hidden_size, num_layers, output_size)
+criterion = nn.MSELoss()
+optimizer = optim.Adam(model.parameters(), lr=learning_rate)
+
+N = 100    #for printing progress
+
+
+# ## Define training
+def train():
+    model.train()
+
+    for epoch in range(num_epochs):
+        COST=0
+        CVCost = 0
+        i = 0
+        for ev,dat,target in trainloader:  # Iterate in batches over the training dataset.; EDIT: added ev
+            out = model(dat)  # Perform a single forward pass.
+            loss = criterion(out, target)  # Compute the loss.
+            loss.backward()  # Derive gradients.
+            optimizer.step()  # Update parameters based on gradients.
+            optimizer.zero_grad()  # Clear gradients.
+            COST += loss.data
+            i += 1
+            # if epoch % 100 == 0 and i ==1 :
+                # print("loss is {}".format(loss.data))
+                # print("target value is {}".format(target))
+                # print("out is {}".format(out))
+
+        cost_list.append(COST)
+
+        #perform a prediction on the validation  data
+        for CVID,CVD,CVT in CVloader:
+
+            with torch.no_grad():
+               CVout = model(CVD)
+            loss = criterion(CVout, CVT)
+
+            CVCost += loss
+
+        CVcost_list.append(CVCost)
+
+        if epoch%N == 0:
+            print("epoch number:{}".format(epoch))
+            print("validation MSE is {}".format(COST))
+            print("train MSE is {}".format(CVCost))
+
+
+# ## Train model
+tstart=time.time()
+print("start time={}".format(tstart))
+train()
+torch.save(model, 'model.pth')    # save model
+
+print((time.time()-tstart))
+
+
+# ### Plot the loss and accuracy
+fig, ax1 = plt.subplots()
+color = 'tab:red'
+ax1.plot(cost_list[200::100], color=color,label="Train")
+ax1.plot(CVcost_list[200::100], color='tab:blue',label="Validation")
+ax1.set_xlabel('epoch', color=color)
+ax1.set_ylabel('Cost', color=color)
+ax1.tick_params(axis='y', color=color)
+ax1.legend()
+plt.savefig("loss.png", dpi=300)
+
+# want red and blue to be close and cost to be low
+
+
+# ## Test model
+def test(loader):
+    model.eval()
+
+    #open output file for fitted track length
+    out_f = open("fitbyeye_wcsim_RNN.txt", "a")
+
+    diff_list = []
+    for evid,data,target in loader:  # Iterate in batches over the training/test dataset.
+        with torch.no_grad():
+            out = model(data)   # just need this for data
+            diff = out - target  # Use the class with highest probability.
+            #print(evid[0], out.data.numpy()[0][0], target.data.numpy()[0])   #evid, fit, truelen
+            #print(diff)
+            out_f.write(str(evid[0]) + "," + str(out.data.numpy()[0][0]) + "\n")
+            diff_list.append(diff.data.numpy()[0][0])  # Check against ground-truth labels.
+            # diff_list.append(diff.data.numpy())  # Check against ground-truth labels.
+    out_f.close()
+    return diff_list  # Derive ratio of correct predictions.
+
+diff_list = test(dataloader)
+
+
+# ### Plot difference btwn model fit and truth info
+plt.hist(diff_list)
+
+mean = np.mean(diff_list)
+std = np.std(diff_list)
+
+# custom_labels = ['Mean is {}'.format(mean), 'Std is {}'.format(std)]
+
+# Add a legend with custom labels
+plt.text(20, 40, f'Mean: {mean:.2f}', fontsize=12, color='red')
+plt.text(20, 35, f'Std: {std:.2f}', fontsize=12, color='green')
+
+
+plt.xlabel("y - yhat")
+plt.ylabel("Number of Event")
+plt.title("RNN Muon Vetex Reconstruction Performance")
+# plt.legend(custom_labels)
+plt.savefig("RNN.png")
+
+print("mean: ",mean)
+print("std: " + str(std))
+
+
+# ## Save the model
+#torch.save or model.save
+#load model in another script and just use
+
diff --git a/configfiles/MuonFitter/RNNFit/rnn_fit.sh b/configfiles/MuonFitter/RNNFit/rnn_fit.sh
@@ -0,0 +1,9 @@
+#!/bin/sh
+
+filelist="/home/jhe/annie/analysis/flist.txt"
+
+while read -r file
+do
+  echo "python3 Fit_data.py ${file}"
+  python3 Fit_data.py ${file}
+done < $filelist