Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize llm selftrain and update README #163

Merged
merged 8 commits into from
Mar 10, 2024
Merged

Organize llm selftrain and update README #163

merged 8 commits into from
Mar 10, 2024

Conversation

ruiyiw
Copy link
Collaborator

@ruiyiw ruiyiw commented Mar 10, 2024

Closes #

πŸ“‘ Description

βœ… Checks

  • My pull request adheres to the code style of this project
  • My code requires changes to the documentation
  • I have updated the documentation as required
  • All the tests have passed
  • Branch name follows type/descript (e.g. feature/add-llm-agents)
  • Ready for code review

β„Ή Additional Information

ruiyiw and others added 8 commits March 9, 2024 23:53
* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
@ruiyiw ruiyiw merged commit 81d3ace into main Mar 10, 2024
3 checks passed
lwaekfjlk added a commit that referenced this pull request Mar 13, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------…
lwaekfjlk added a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------…
lwaekfjlk pushed a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk pushed a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk added a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
Signed-off-by: Haofei Yu <1125027232@qq.com>
lwaekfjlk pushed a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk pushed a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk pushed a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk pushed a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk added a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------…
lwaekfjlk added a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
Signed-off-by: Haofei Yu <1125027232@qq.com>
lwaekfjlk pushed a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>
(cherry picked from commit 991f3983cf6b9355df247ba64beb5420047b6e27)
lwaekfjlk added a commit that referenced this pull request Mar 14, 2024
* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* organize llm selftrain

* Merge main to modular-codebase (#162)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#161)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

* Organize llm deploy

* Organize llm deploy and update README (#159)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize llm deploy and update README (#160)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

* Organize data process and update README (#158)

* Organize data generation and update README for release (#157)

* Modularize data generation (#144)

* revamped cloud sync

* finished cloud utils

* modify cloud utils and add monitor function for downloading

* added requirements

* need test

* finished training on babel

* Add selftrain scripts for deploy and eval on babel

* modularize self-train improve step

* modularize data generation

* Update README.md

adding run instruction for scenario generations

* add system args and rename

* use args parses rather than sys

* Update README.md

update arguments

* delete test functions

* move generate scenario code outside

* move file and delete useless files

* reset path

* make sure SFT scenarios have 10 agent combos

* Update README.md

reorder README

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Feature: Full-parameter, lora, and qlora finetune script for mistral (#126)

* added S3 script

* working version of each script

* minor changes

* minor changes

* Support Otree-based Human Eval (#145)

* add the initial version of otree-based human eval

* delete game

* add payment info

* modified instruction page

* support user ID matching

* fix deployment file code

* support timer

* add reasoning stringfield

* support queue based data

* support queue, add personal information, polish front-end

* change name of directory

* modified instruction page style

* changed position of next button

* move the next button to the middle

* debugging for reward prompt regex matching

* modify the frontend and fix queue popping time bug

* support input prolific ID

* polish frontend

* support multiple timer in one page

* polish front-end style for multi-choices

* delete profile png

* support pilot study data and format

* split pilot study and official study

* delete useless file

* add two different thank you page

* modify name in url

* ready to release pilot study

* fix same input bug

* fix frontend bugs

* add prompt for reasoning

* add timer and change time limit

* chose the debug mode

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>

* Feature/support scripts for env fetching from db (#147)

* get env from redis

* support scripts for env fetch from db

* Pick qualified annotators for human evaluation & support official human eval test (#148)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* Bug/fix-human-eval-official-study-bug (#149)

* add human eval analysis for qualification test

* fix bug that in the random choice

* fix bug for incomplete filling

* clean useless elements for timer

* make debug false

* modify the pilot study payment info

* ready to publish offcial study

* delete debug code

* change name

* release the new official study with new data

* fix official study bug

* fix bugs

* fix bugs

* Feature/support official human eval analysis and delete sotopia_tmp file (#150)

* support pearson correlation analysis

* delete tmp file

* Add readme img (#152)

* Feature/support official human eval mean + correlation analysis (#151)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* Feature/Finalize Human Evaluation (#153)

* support pearson correlation analysis

* delete tmp file

* support official distribution of human eval

* add official test analysis

* support huamn eval analysis

* delete useless things

* support the full analysis code

* finalize the final round of human eval and get all results

* add all the code used for paper testing

* add all the data and clean the code

* clean the code

* delete together-ai-ft part (#154)

* Feature/support paired t test (#155)

* support t-test and fix None scenario in the final human eval data

* fully support all the paired-t-testing between all model pairs

* delete paired t test

* add code for human eval plot in the paper (#156)

* Update README.md

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

* Organize data process

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------

Co-authored-by: Jasonqi146 <jasonqi146@gmail.com>
Co-authored-by: Wonderplex <50866817+Jasonqi146@users.noreply.github.com>
Co-authored-by: Ruiyi Wang <ruiyiwan@babel-login-2.lti.cs.cmu.edu>
Co-authored-by: Sharon Zhang <123585394+sharonwx54@users.noreply.github.com>
Co-authored-by: sharonwx54 <sharon.wx.zhang@gmail.com>
Co-authored-by: Haofei Yu <1125027232@qq.com>

---------…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant