Skip to content

Commit

Permalink
MuonFitter RNNFit model scripts (#290)
Browse files Browse the repository at this point in the history
* RNNFit model for MuonFitter

* Fixed read in of Fit script

* moved model files out and added some documentation for model generation

* removed extract_partnumber.py and added info to README for MuonFitter. Also changed location of MuonFitter README from MuonFitter/RNNFit/ to MuonFitter/

* Update README.md

---------

Co-authored-by: James Minock <jminock@anniegpvm02.fnal.gov>
Co-authored-by: marc1uk <marc1uk_@hotmail.com>
  • Loading branch information
3 people authored Nov 19, 2024
1 parent b1d0be9 commit 88afcec
Show file tree
Hide file tree
Showing 5 changed files with 392 additions and 0 deletions.
43 changes: 43 additions & 0 deletions configfiles/MuonFitter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#MuonFitter Config

***********************
#Description
**********************

Date created: 2024-10-02
The MuonFitter toolchain makes an attempt to fit muons using hit information.

The Tool has 2 modes. The first mode is pre-reconstruction. It takes input information from the ANNIEEvent and generates a text file containing hit information for the RNN. It is advisable to include minimal tools in this ToolChain, as the same data must be re-analysed with ToolAnalysis later.

This text file produced in this step is then processed by a standalone python script (Fit_data.py), which outputs fitted information into a new text file.

The second mode is reconstruction. In this mode the Tool reads information from the ANNIEEvent, along with both text files from the previous two steps (the first mode and the python script) and reconstructs the vertex based on the fitted paths. The resulting muon fit information is passed into the DataModel.

More detailed instructions are below.

************************
#Usage
************************

To generate (train) a model:
============================

1. Data_prepare.py

2. RNN_train.py

NOTE: Data_prepare.py requires input files that do not yet exist on the ANNIE gpvms; as a result the model cannot at present be re-trained.
Previously generated models are stored in /pnfs/annie/persistent/simulations/models/MuonFitter/ which may be used.

Please update any paths in the Tool configuration and Fit_data.py accordingly, or copy the appropriate model files to the configfiles/MuonFitter/RNNFit directory.

To analyse data:
================

1. First, run a ToolChain containing the MuonFitter Tool configured in "RecoMode 0". This will generate a file: ev_ai_eta_R{RUN}.txt with a {RUN} number corresponding to the WCSim run number. You should not include any Tools further along the ToolChain for this step.

2. Second, run "python3 Fit_data.py ev_ai_eta_R{RUN}.txt". This will apply the fitting and generate another textfile to be ran in ToolAnalysis: tanktrackfitfile_r{RUN}_RNN.txt. Again: please update any paths such that all files and models are available.

NOTE: the script RNNFit/rnn_fit.sh can act as a helper to process multiple ev_ai_eta_R{RUN}.txt files. Be sure to update the path in the script to point to your ev_ai_eta text files from step 1.

3. Finally, run a ToolChain containing the MuonFitter Tool configured in "RecoMode 1". Please set the paths for the ev_ai_eta_R{RUN}.txt and tanktrackfitfile_r{RUN}_RNN.txt in the MuonFitter config file accordingly. You may include any downstream tools you desire for further analysis. See the UserTools/MuonFitter/README.md for short descriptions of information saved to the DataModel and how to access them.
49 changes: 49 additions & 0 deletions configfiles/MuonFitter/RNNFit/Data_prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# coding: utf-8

# # Import modules
import pandas as pd
import json


# # Prepare data into pandas DataFrame
dataX = pd.read_csv("/home/jhe/annie/analysis/Muon_vertex/X.txt",sep=',',header=None,names=['id','ai','eta']) #ai is track segment
dataY = pd.read_csv("/home/jhe/annie/analysis/Muon_vertex/Y.txt",sep=',',header=None,names=['id','truetracklen'])
# dataX['combine'] = dataX[['X','Y']].values.tolist()


# ### Preview dataframes
dataX.head(5)
dataY.head(5)


# # Aggregate data and filter out extremely long and negative tracks
grouped = dataX.groupby('id').agg(list).reset_index()
data = pd.merge(grouped, dataY, on='id')
print("after merge: " + str(len(data)))
criteria = data['truetracklen'] > 1000
data = data[~criteria]
print("after first filter (>1000): " + str(len(data)))
critiera = data['truetracklen'] < 0
data = data[~criteria]
print("after second filter (<0): " + str(len(data)))
print(data.columns)
data.head(10)


# # Prepare data into json format
json_data = data.to_json(orient='index')


# # Save json data into .json file
file_path = './data.json'
with open(file_path, 'w') as json_file:
json_file.write(json_data)


# # Save pandas DataFrame as csv file
data.to_csv('./data.csv', index=False)


# # Save data into h5 file << USE THIS ONE
data.to_hdf("./data.h5",key='df')

77 changes: 77 additions & 0 deletions configfiles/MuonFitter/RNNFit/Fit_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# coding: utf-8
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import time
import sys

if (len(sys.argv) != 2):
print(" @@@@@ MISSING ev_ai_eta_R{RUN}.txt FILE !! @@@@@ ")
print(" syntax: python3 Fit_data.py ev_ai_eta_R{RUN}.txt")
print(" path: ~/annie/analysis/FILE")
exit(-1)

DATAFILE = sys.argv[-1]

# ## Extract run number from filename
# ## NOTE: Assumes filename has this structure: ev_ai_eta_R{RUN}.txt
RUN = DATAFILE[12:-4]

# ## Define ManyToOneRNN class
class ManyToOneRNN(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super(ManyToOneRNN, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True,nonlinearity='relu')
self.fc = nn.Linear(hidden_size, output_size)

def forward(self, x):
# Initialize hidden state with zeros
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
# Forward propagate the RNN
out, _ = self.rnn(x,h0)

# Decode the hidden state of the last time step
out = self.fc(out[:, -1, :])
return out


# ## Load model
model = torch.load('model.pth')
model.eval()


# ## Load data (needs to be in Tensor format)
data = pd.read_csv(DATAFILE, header=None, names=['evid','cluster_time','ai','eta'])
print(data.head(5))

data = data.groupby(['evid', 'cluster_time']).agg(list).reset_index()
print(data.head(5))

print(data.iloc[0,2:])

# ## Do the fit
# open output file
OUTFILENAME = "tanktrackfitfile_r" + RUN + "_RNN.txt"
out_f = open(OUTFILENAME, "a")

for idx in range(len(data)):
dataT = torch.tensor(data.iloc[idx,2:]).t()
dataT.unsqueeze_(0)
out = model(dataT)
print(data.iloc[idx,0], out)
out_f.write(str(data.iloc[idx,0]) + "," + str(data.iloc[idx,1]) + "," + str(out.data.numpy()[0][0]) + "\n")

# close output file
out_f.close()

############################################################
############### EOF
############################################################
214 changes: 214 additions & 0 deletions configfiles/MuonFitter/RNNFit/RNN_train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
# coding: utf-8

# ## Import modules
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import time


# ## Load data, using half of the dataset to train
data = pd.read_hdf("data.h5", 'df')
train_df, test_df = train_test_split(data, test_size=0.5)
test_df, CV_df = train_test_split(test_df, test_size=0.5)
# validation used to keep track of training set; monitor accuracy

# ### Preview data
print("len train_df: " + str(len(train_df)))
train_df.head(5)

print("len test_df: " + str(len(test_df)))
test_df.head(5)

print("len CV_df: " + str(len(CV_df)))
CV_df.head(5)

print(data[data['truetracklen'] < 0.])


# ## Define MyDataset class
class MyDataset(Dataset):
def __init__(self, dataframe):
self.data = dataframe

def __len__(self):
return len(self.data)

def __getitem__(self, idx):
evid = self.data.iloc[idx,0] #added to include evid in output file
features = torch.tensor(self.data.iloc[idx, 1:-1], dtype=torch.float32)
features = features.t()
target = torch.tensor(self.data.iloc[idx, -1], dtype=torch.float32)
return evid, features, target #added evid as return value

# ### Prepare data for training
batch_size = 1 # Adjust batch size as needed
# sequence_length = 3 # Adjust sequence length as needed
shuffle = True # Shuffle the data during training (recommended)
dataS = MyDataset(data)
train = MyDataset(train_df)
test = MyDataset(test_df)
CVS = MyDataset(CV_df)
trainloader = DataLoader(train, batch_size=1, shuffle=shuffle) #how much data to train per epoch
testloader = DataLoader(test)
dataloader = DataLoader(dataS, batch_size=1, shuffle=shuffle)
CVloader = DataLoader(CVS)

print(dataloader)


# ## Define ManyToOneRNN class
class ManyToOneRNN(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super(ManyToOneRNN, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True,nonlinearity='relu')
self.fc = nn.Linear(hidden_size, output_size)

def forward(self, x):
# Initialize hidden state with zeros
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
# Forward propagate the RNN
out, _ = self.rnn(x,h0)

# Decode the hidden state of the last time step
out = self.fc(out[:, -1, :])
return out

# ### Set parameters for training model
cost_list = []
CVcost_list = []
input_size = 2
hidden_size = 4
num_layers = 1
output_size = 1
learning_rate = 0.001 #can tune this for better model
num_epochs = 10000 #default:10000

model = ManyToOneRNN(input_size, hidden_size, num_layers, output_size)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

N = 100 #for printing progress


# ## Define training
def train():
model.train()

for epoch in range(num_epochs):
COST=0
CVCost = 0
i = 0
for ev,dat,target in trainloader: # Iterate in batches over the training dataset.; EDIT: added ev
out = model(dat) # Perform a single forward pass.
loss = criterion(out, target) # Compute the loss.
loss.backward() # Derive gradients.
optimizer.step() # Update parameters based on gradients.
optimizer.zero_grad() # Clear gradients.
COST += loss.data
i += 1
# if epoch % 100 == 0 and i ==1 :
# print("loss is {}".format(loss.data))
# print("target value is {}".format(target))
# print("out is {}".format(out))

cost_list.append(COST)

#perform a prediction on the validation data
for CVID,CVD,CVT in CVloader:

with torch.no_grad():
CVout = model(CVD)
loss = criterion(CVout, CVT)

CVCost += loss

CVcost_list.append(CVCost)

if epoch%N == 0:
print("epoch number:{}".format(epoch))
print("validation MSE is {}".format(COST))
print("train MSE is {}".format(CVCost))


# ## Train model
tstart=time.time()
print("start time={}".format(tstart))
train()
torch.save(model, 'model.pth') # save model

print((time.time()-tstart))


# ### Plot the loss and accuracy
fig, ax1 = plt.subplots()
color = 'tab:red'
ax1.plot(cost_list[200::100], color=color,label="Train")
ax1.plot(CVcost_list[200::100], color='tab:blue',label="Validation")
ax1.set_xlabel('epoch', color=color)
ax1.set_ylabel('Cost', color=color)
ax1.tick_params(axis='y', color=color)
ax1.legend()
plt.savefig("loss.png", dpi=300)

# want red and blue to be close and cost to be low


# ## Test model
def test(loader):
model.eval()

#open output file for fitted track length
out_f = open("fitbyeye_wcsim_RNN.txt", "a")

diff_list = []
for evid,data,target in loader: # Iterate in batches over the training/test dataset.
with torch.no_grad():
out = model(data) # just need this for data
diff = out - target # Use the class with highest probability.
#print(evid[0], out.data.numpy()[0][0], target.data.numpy()[0]) #evid, fit, truelen
#print(diff)
out_f.write(str(evid[0]) + "," + str(out.data.numpy()[0][0]) + "\n")
diff_list.append(diff.data.numpy()[0][0]) # Check against ground-truth labels.
# diff_list.append(diff.data.numpy()) # Check against ground-truth labels.
out_f.close()
return diff_list # Derive ratio of correct predictions.

diff_list = test(dataloader)


# ### Plot difference btwn model fit and truth info
plt.hist(diff_list)

mean = np.mean(diff_list)
std = np.std(diff_list)

# custom_labels = ['Mean is {}'.format(mean), 'Std is {}'.format(std)]

# Add a legend with custom labels
plt.text(20, 40, f'Mean: {mean:.2f}', fontsize=12, color='red')
plt.text(20, 35, f'Std: {std:.2f}', fontsize=12, color='green')


plt.xlabel("y - yhat")
plt.ylabel("Number of Event")
plt.title("RNN Muon Vetex Reconstruction Performance")
# plt.legend(custom_labels)
plt.savefig("RNN.png")

print("mean: ",mean)
print("std: " + str(std))


# ## Save the model
#torch.save or model.save
#load model in another script and just use

9 changes: 9 additions & 0 deletions configfiles/MuonFitter/RNNFit/rnn_fit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/sh

filelist="/home/jhe/annie/analysis/flist.txt"

while read -r file
do
echo "python3 Fit_data.py ${file}"
python3 Fit_data.py ${file}
done < $filelist

0 comments on commit 88afcec

Please sign in to comment.