Organize llm selftrain and update README #163

ruiyiw · 2024-03-10T20:58:35Z

Closes #

📑 Description

✅ Checks

My pull request adheres to the code style of this project
My code requires changes to the documentation
I have updated the documentation as required
All the tests have passed
Branch name follows type/descript (e.g. feature/add-llm-agents)
Ready for code review

ℹ Additional Information

* Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com>

* Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com>

… into modular-codebase

* Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> ---------…

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27) Signed-off-by: Haofei Yu <1125027232@qq.com>

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> ---------…

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27) Signed-off-by: Haofei Yu <1125027232@qq.com>

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)

* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> ---------…

ruiyiw and others added 8 commits March 9, 2024 23:53

Organize data process

abb2df3

Organize llm deploy

4272435

organize llm selftrain

e1018c2

Merge branch 'modular-codebase' of github.com:sotopia-lab/sotopia-llm…

b0d0fb8

… into modular-codebase

ruiyiw merged commit 81d3ace into main Mar 10, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Organize llm selftrain and update README #163

Organize llm selftrain and update README #163

ruiyiw commented Mar 10, 2024

Organize llm selftrain and update README #163

Organize llm selftrain and update README #163

Conversation

ruiyiw commented Mar 10, 2024

📑 Description

✅ Checks

ℹ Additional Information