Sentence-Transformers Information Retrieval example on Chinese
Sentence Transformers is a multilingual sentence embedding generate framework, which provides an easy method to compute dense vector representations for sentences and paragraphs (based on HuggingFace Transformers)
This repository target at ms_macro like task on a Chinese dataset, train bi_encoder and cross_encoder, with the help of elasticsearch easy interface on pandas to build serlizable conclusion.
- pip
pip install -r requirements.txt
- install Elasticsearch and start service
1. Download Data from google drive
4. cross_encoder train data prepare
5. cross_encoder valid data prepare
7. show bi_encoder cross_encoder inference
* 1 This repository use edited es-pandas interface (support vector serlized) to have a simple manipulate on elasticsearch by pandas.
* 2 try_sbert_neg_sampler.py sample hard negative samples drived from class provide by https://guzpenha.github.io/transformer_rankers/ can also use elastic search to generate hard samples , relate functions have defined in valid_cross_encoder_on_bi_encoder.py
* 3 Before training your dataset on cross_encoder, should take a look at the semantic similarity between different questions. Combine some samples with similar semantic may give help.
* 4 Add some toolkit to Sbert to support multi-class-evaluation (as dictionary) ## Contributing
Distributed under the MIT License. See LICENSE
for more information.
svjack - svjackbt@gmail.com
Project Link: https://github.com/svjack/Sbert-ChineseExample