-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Organize llm selftrain and update README #163
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com>
* Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com>
* Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com>
β¦ into modular-codebase
* Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com>
lwaekfjlk
added a commit
that referenced
this pull request
Mar 13, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> ---------β¦
lwaekfjlk
added a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> ---------β¦
lwaekfjlk
pushed a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk
pushed a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk
added a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27) Signed-off-by: Haofei Yu <1125027232@qq.com>
lwaekfjlk
pushed a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk
pushed a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk
pushed a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk
pushed a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk
added a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> ---------β¦
lwaekfjlk
added a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27) Signed-off-by: Haofei Yu <1125027232@qq.com>
lwaekfjlk
pushed a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> (cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk
added a commit
that referenced
this pull request
Mar 14, 2024
* Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * organize llm selftrain * Merge main to modular-codebase (#162) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#161) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process * Organize llm deploy * Organize llm deploy and update README (#159) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize llm deploy and update README (#160) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md * Organize data process and update README (#158) * Organize data generation and update README for release (#157) * Modularize data generation (#144) * revamped cloud sync * finished cloud utils * modify cloud utils and add monitor function for downloading * added requirements * need test * finished training on babel * Add selftrain scripts for deploy and eval on babel * modularize self-train improve step * modularize data generation * Update README.md adding run instruction for scenario generations * add system args and rename * use args parses rather than sys * Update README.md update arguments * delete test functions * move generate scenario code outside * move file and delete useless files * reset path * make sure SFT scenarios have 10 agent combos * Update README.md reorder README --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Feature: Full-parameter, lora, and qlora finetune script for mistral (#126) * added S3 script * working version of each script * minor changes * minor changes * Support Otree-based Human Eval (#145) * add the initial version of otree-based human eval * delete game * add payment info * modified instruction page * support user ID matching * fix deployment file code * support timer * add reasoning stringfield * support queue based data * support queue, add personal information, polish front-end * change name of directory * modified instruction page style * changed position of next button * move the next button to the middle * debugging for reward prompt regex matching * modify the frontend and fix queue popping time bug * support input prolific ID * polish frontend * support multiple timer in one page * polish front-end style for multi-choices * delete profile png * support pilot study data and format * split pilot study and official study * delete useless file * add two different thank you page * modify name in url * ready to release pilot study * fix same input bug * fix frontend bugs * add prompt for reasoning * add timer and change time limit * chose the debug mode --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> * Feature/support scripts for env fetching from db (#147) * get env from redis * support scripts for env fetch from db * Pick qualified annotators for human evaluation & support official human eval test (#148) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * Bug/fix-human-eval-official-study-bug (#149) * add human eval analysis for qualification test * fix bug that in the random choice * fix bug for incomplete filling * clean useless elements for timer * make debug false * modify the pilot study payment info * ready to publish offcial study * delete debug code * change name * release the new official study with new data * fix official study bug * fix bugs * fix bugs * Feature/support official human eval analysis and delete sotopia_tmp file (#150) * support pearson correlation analysis * delete tmp file * Add readme img (#152) * Feature/support official human eval mean + correlation analysis (#151) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * Feature/Finalize Human Evaluation (#153) * support pearson correlation analysis * delete tmp file * support official distribution of human eval * add official test analysis * support huamn eval analysis * delete useless things * support the full analysis code * finalize the final round of human eval and get all results * add all the code used for paper testing * add all the data and clean the code * clean the code * delete together-ai-ft part (#154) * Feature/support paired t test (#155) * support t-test and fix None scenario in the final human eval data * fully support all the paired-t-testing between all model pairs * delete paired t test * add code for human eval plot in the paper (#156) * Update README.md --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> * Organize data process --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> --------- Co-authored-by: Jasonqi146 <jasonqi146@gmail.com> Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com> Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu> Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com> Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com> Co-authored-by: Haofei Yu <1125027232@qq.com> ---------β¦
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #
π Description
β Checks
type/descript
(e.g.feature/add-llm-agents
)βΉ Additional Information