Skip to content

Commit f2dc9ad

Browse files
ruiyiwlwaekfjlk
authored andcommitted
Organize llm selftrain and update README (#163)
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
1 parent 078b9bc commit f2dc9ad

16 files changed

+544
-52
lines changed

llm_self_train/README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# Training (BC and/or SR) Pipeline
12
## Preparations
23
### Modify `config.yml`
34
1. Change `experiment_name` and `mkdir experiment_name` in `checkpoint_dir`. Make sure the starting checkpoint and base Mistral model is under `experiment_name` folder.
@@ -16,7 +17,7 @@
1617
## Run Code
1718
1. Activate conda: `conda activate myenv`
1819
2. Run `python3 monitor_and_submit.py`
19-
2. Open a separate terminal and activate conda. Run `sbatch --gres=gpu:4 --mem=80g -t 1-00:00:00 -o train.out -e train.err train.sh`
20+
2. Open a separate terminal and activate conda. Run `sbatch train.sbatch`
2021

2122

2223
## Comments

llm_self_train/check_episodes.py

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import argparse
2+
import os
3+
import json
4+
os.environ[
5+
"REDIS_OM_URL"
6+
] = "redis://:password@server_name:port_num"
7+
from sotopia.database.logs import EpisodeLog
8+
9+
def main():
10+
parser = argparse.ArgumentParser()
11+
parser.add_argument("--tag", type=str, required=True)
12+
parser.add_argument("--env-ids", type=str, required=True)
13+
args = parser.parse_args()
14+
15+
eps = list(EpisodeLog.find(EpisodeLog.tag == args.tag))
16+
with open("resources/env_ids.json", 'r') as f:
17+
envs = json.loads(f.read())[args.env_ids]
18+
19+
for env in envs:
20+
eps_per_env = list(EpisodeLog.find(EpisodeLog.tag == args.tag,
21+
EpisodeLog.environment == env))
22+
print(len(eps_per_env))
23+
24+
25+
count = 0
26+
print(len(eps))
27+
for i in range(len(eps)):
28+
if eps[i].rewards == [0.0, 0.0]:
29+
print(i, end=', ')
30+
count += 1
31+
EpisodeLog.delete(pk=eps[i].pk)
32+
print(count)
33+
34+
35+
if __name__ == "__main__":
36+
main()

llm_self_train/config.yml

+7-5
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,25 @@
11
# self train
22
babel_username: ruiyiwan
3-
experiment_name: base-sft-round-2
3+
experiment_name: selftrain-sft-round-2-filtered-top-2
44
num_improve_steps: 1
55
script_dir: /home/ruiyiwan/sotopia-llm/llm_self_train
66
checkpoint_dir: /data/tir/projects/tir6/bisk/ruiyiwan/selftrain
7-
checkpoint_saved_queue: /home/ruiyiwan/sotopia-llm/llm_self_train/logs/base-sft-round-2/deploy_queue.txt
7+
checkpoint_saved_queue: /home/ruiyiwan/sotopia-llm/llm_self_train/logs/selftrain-sft-round-2-filtered-top-2/deploy_queue.txt
88
num_train_epochs: 20.0
99
call_back_save_epochs: 1
1010

1111
# training
1212
num_gpus: 4
13-
model_name_or_path: /data/tir/projects/tir6/bisk/ruiyiwan/selftrain/base-sft-round-2/Mistral-7B-v0.1
13+
model_name_or_path: /data/tir/projects/tir6/bisk/ruiyiwan/selftrain/selftrain-sft-round-2-filtered-top-2/checkpoint_init_epoch-3
1414
hf_auth_token: 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
1515
wandb_project: self-train
16-
wandb_tags: "['base-mistral-sft-round-2']"
16+
wandb_tags: "['selftrain-sft-round-2-filtered-top-2']"
1717
wandb_token: eca44f65849afa1cc146c22631b0b5001ccd24d7
1818

1919
# deploy and eval: check resources/env_ids.json
20-
eval_env_ids_tag: pilot-3_dev
20+
eval_env_ids_tag: sotopia_hard_env_id
21+
multiturn_eval: True
22+
dev: False
2123

2224
# redis
2325
redis_om_url: redis://:password@server_name:port_num

llm_self_train/eval_score.py

+139
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
import argparse
2+
import os
3+
os.environ[
4+
"REDIS_OM_URL"
5+
] = "redis://:password@server_name:port_num"
6+
from sotopia.database.logs import EpisodeLog
7+
from sotopia.database.persistent_profile import EnvironmentProfile
8+
from sotopia.database.persistent_profile import AgentProfile
9+
import json
10+
import numpy as np
11+
12+
# tag = "pilot-2_checkpoint_improve-0_epoch-3_gpt-3.5-turbo_dev"
13+
# target_model = "custom_model"
14+
15+
# hard_envs = ["01HJPQ34Y3S1TDPTRX1CCH6VPG", "01HJPQ34ZG9WZEDX6BV5QZB1QG"]
16+
17+
def gen_target_result_dict(envs: list, tag: str, target_model: str)->dict:
18+
target_result_by_env = []
19+
for env_profile_id in envs:
20+
21+
env = EnvironmentProfile.get(env_profile_id)
22+
target_result_dict = {"env_profile_id": env_profile_id,
23+
"scenario": env.scenario,
24+
"target_as_agent_1": {},
25+
"target_as_agent_2": {}
26+
}
27+
28+
target_result_dict["target_as_agent_1"] = {
29+
"agent_env_goal": env.agent_goals[0],
30+
"agent_performance_by_profile": []
31+
}
32+
33+
target_result_dict["target_as_agent_2"] = {
34+
"agent_env_goal": env.agent_goals[1],
35+
"agent_performance_by_profile": []
36+
}
37+
38+
eps = list(EpisodeLog.find(EpisodeLog.tag == tag,
39+
EpisodeLog.environment == env_profile_id))
40+
41+
for i in range(len(eps)):
42+
if eps[i].models[1] == target_model: # target as agent 1
43+
44+
agent_id = eps[i].agents[0]
45+
agent_profile = list(AgentProfile.find(
46+
AgentProfile.pk == agent_id))[0]
47+
agent_first_name, agent_last_name = agent_profile.first_name, agent_profile.last_name
48+
agent_performance_dict = {
49+
"agent_profile_id": agent_id,
50+
"agent_first_name": agent_first_name,
51+
"agent_last_name": agent_last_name,
52+
"reward": eps[i].rewards[0],
53+
"reasoning": eps[i].reasoning
54+
}
55+
target_result_dict["target_as_agent_1"]["agent_performance_by_profile"].append(
56+
agent_performance_dict)
57+
58+
if eps[i].models[2] == target_model:
59+
agent_id = eps[i].agents[1]
60+
agent_profile = list(AgentProfile.find(
61+
AgentProfile.pk == agent_id))[0]
62+
agent_first_name, agent_last_name = agent_profile.first_name, agent_profile.last_name
63+
agent_performance_dict = {
64+
"agent_profile_id": agent_id,
65+
"agent_first_name": agent_first_name,
66+
"agent_last_name": agent_last_name,
67+
"reward": eps[i].rewards[1],
68+
"reasoning": eps[i].reasoning
69+
}
70+
target_result_dict["target_as_agent_2"]["agent_performance_by_profile"].append(
71+
agent_performance_dict)
72+
73+
target_result_by_env.append(target_result_dict)
74+
75+
return target_result_by_env
76+
77+
78+
def eval_average(target_result_by_env: dict, tag: str)->dict:
79+
avg_dict = {
80+
"believability": 0.0,
81+
"relationship": 0.0,
82+
"knowledge": 0.0,
83+
"secret": 0.0,
84+
"social_rules": 0.0,
85+
"financial_and_material_benefits": 0.0,
86+
"goal": 0.0,
87+
"overall_score": 0.0
88+
}
89+
90+
eps = list(EpisodeLog.find(EpisodeLog.tag == tag))
91+
92+
for result_dict in target_result_by_env:
93+
for key in avg_dict:
94+
if len(result_dict["target_as_agent_1"]["agent_performance_by_profile"]) == 0:
95+
perf_as_agent_1 = 0
96+
else:
97+
perf_as_agent_1 = np.sum([
98+
agent_profile["reward"][1][key] for agent_profile in result_dict["target_as_agent_1"]["agent_performance_by_profile"]])
99+
if len(result_dict["target_as_agent_2"]["agent_performance_by_profile"]) == 0:
100+
perf_as_agent_2 = 0
101+
else:
102+
perf_as_agent_2 = np.sum([
103+
agent_profile["reward"][1][key] for agent_profile in result_dict["target_as_agent_2"]["agent_performance_by_profile"]])
104+
# print(len(result_dict["target_as_agent_1"]["agent_performance_by_profile"]))
105+
# print(len(result_dict["target_as_agent_2"]["agent_performance_by_profile"]))
106+
# avg_dict[key] += (perf_as_agent_1 + perf_as_agent_2) / 2 / len(target_result_by_env)
107+
avg_dict[key] += (perf_as_agent_1 + perf_as_agent_2) / len(eps)
108+
# avg_dict[key] += (perf_as_agent_1 + perf_as_agent_2) / 14
109+
110+
return avg_dict
111+
112+
113+
def main():
114+
parser = argparse.ArgumentParser()
115+
parser.add_argument("--tag", type=str, required=True)
116+
parser.add_argument("--target-model", type=str, default="custom_model")
117+
parser.add_argument("--env-ids-tag", type=str, required=True)
118+
parser.add_argument("--out-dir", type=str, required=True)
119+
args = parser.parse_args()
120+
121+
with open("resources/env_ids.json", 'r') as f:
122+
env_dict = json.loads(f.read())
123+
envs = env_dict[args.env_ids_tag]
124+
125+
target_result_by_env = gen_target_result_dict(envs=envs, target_model=args.target_model, tag=args.tag)
126+
127+
avg_dict = eval_average(target_result_by_env, tag=args.tag)
128+
129+
if not os.path.isdir(args.out_dir):
130+
os.mkdir(args.out_dir)
131+
with open(os.path.join(args.out_dir, f"{args.tag}.json"), 'w') as f:
132+
f.write(json.dumps(avg_dict, indent=4))
133+
with open(os.path.join(args.out_dir, f"dict.json"), 'w') as f:
134+
f.write(json.dumps(target_result_by_env, indent=4))
135+
136+
137+
if __name__ == "__main__":
138+
main()
139+

llm_self_train/monitor_and_submit.py

+19-8
Original file line numberDiff line numberDiff line change
@@ -5,26 +5,37 @@
55
import multiprocessing
66
import time
77
import json
8-
from pipelines.monitor_utils import check_log_and_submit_deploy, check_log_and_cancel_deploy
8+
import shutil
9+
10+
os.umask(0o000)
911

1012
with open('config.yml', 'r') as f:
1113
config = yaml.safe_load(f)
1214

13-
with open("resources/deploy_config.yml", 'r') as f:
15+
16+
log_dir = f"{config['script_dir']}/logs/{config['experiment_name']}"
17+
if not os.path.exists(log_dir):
18+
os.makedirs(log_dir)
19+
print(f"Created directory {log_dir}")
20+
21+
if not os.path.isfile(os.path.join(log_dir, "deploy_config.yml")):
22+
source_deploy_file = "resources/deploy_config.yml"
23+
shutil.copy(source_deploy_file, log_dir+'/')
24+
print("Copied deploy_config.yml")
25+
26+
27+
with open(os.path.join(log_dir, "deploy_config.yml"), 'r') as f:
1428
deploy_config = yaml.safe_load(f)
1529

16-
deploy_config['log_dir'] = f"{config['script_dir']}/logs/{config['experiment_name']}"
30+
deploy_config['log_dir'] = log_dir
1731
deploy_config['tmp_dir'] = f"{config['script_dir']}/tmp/{config['experiment_name']}"
1832

19-
with open('resources/deploy_config.yml', 'w') as f:
33+
with open(os.path.join(log_dir, "deploy_config.yml"), 'w') as f:
2034
yaml.dump(deploy_config, f)
2135

36+
from pipelines.monitor_utils import check_log_and_submit_deploy, check_log_and_cancel_deploy
2237

2338
def main():
24-
os.umask(0o000)
25-
if not os.path.exists(deploy_config["log_dir"]):
26-
os.makedirs(deploy_config["log_dir"])
27-
print(f"Created directory {deploy_config['log_dir']}")
2839
if not os.path.exists(deploy_config["tmp_dir"]):
2940
os.makedirs(deploy_config["tmp_dir"])
3041
print(f"Created directory {deploy_config['tmp_dir']}")

llm_self_train/pipelines/monitor_deploy_and_run_eval.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@
66
with open('config.yml', 'r') as f:
77
config = yaml.safe_load(f)
88

9-
with open("resources/deploy_config.yml", 'r') as f:
9+
log_dir = f"{config['script_dir']}/logs/{config['experiment_name']}"
10+
with open(os.path.join(log_dir, "deploy_config.yml"), 'r') as f:
1011
deploy_config = yaml.safe_load(f)
1112

1213

@@ -52,7 +53,7 @@ def run_eval():
5253
commands = f"""
5354
cd {config['script_dir']}
5455
conda activate myenv
55-
bash pipelines/submit_eval.sh > {deploy_config['log_dir']}/eval_results_{deploy_config['ckpt_name']}.txt
56+
bash {os.path.join(log_dir, f"submit_eval_{deploy_config['ckpt_name']}.sh")} > {deploy_config['log_dir']}/eval_results_{deploy_config['ckpt_name']}.txt
5657
"""
5758
subprocess.run(commands, shell=True)
5859

llm_self_train/pipelines/monitor_eval_and_stop_deploy.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@
66
with open('config.yml', 'r') as f:
77
config = yaml.safe_load(f)
88

9-
with open("resources/deploy_config.yml", 'r') as f:
9+
log_dir = f"{config['script_dir']}/logs/{config['experiment_name']}"
10+
with open(os.path.join(log_dir, "deploy_config.yml"), 'r') as f:
1011
deploy_config = yaml.safe_load(f)
1112

1213
def main():

0 commit comments

Comments
 (0)