Skip to content

Latest commit

 

History

History
338 lines (279 loc) · 12.8 KB

README_EN.md

File metadata and controls

338 lines (279 loc) · 12.8 KB

Sbert-ChineseExample

Sentence-Transformers Information Retrieval example on Chinese


中文介绍

Table of Contents

About The Project

Sentence Transformers is a multilingual sentence embedding generate framework, which provides an easy method to compute dense vector representations for sentences and paragraphs (based on HuggingFace Transformers)

This repository target at ms_macro like task on a Chinese dataset, train bi_encoder and cross_encoder, with the help of elasticsearch easy interface on pandas to build serlizable conclusion.

Built With

Getting Started

Installation

  • pip
pip install -r requirements.txt
  • install Elasticsearch and start service

Usage

1. Download Data from google drive

2. bi_encoder data prepare

3. train bi_encoder

4. cross_encoder train data prepare

5. cross_encoder valid data prepare

6. train cross_encoder

7. show bi_encoder cross_encoder inference

Roadmap


* 1 This repository use edited es-pandas interface (support vector serlized) to have a simple manipulate on elasticsearch by pandas.
* 2 try_sbert_neg_sampler.py sample hard negative samples drived from class provide by https://guzpenha.github.io/transformer_rankers/ can also use elastic search to generate hard samples , relate functions have defined in valid_cross_encoder_on_bi_encoder.py
* 3 Before training your dataset on cross_encoder, should take a look at the semantic similarity between different questions. Combine some samples with similar semantic may give help.
* 4 Add some toolkit to Sbert to support multi-class-evaluation (as dictionary) ## Contributing

License

Distributed under the MIT License. See LICENSE for more information.

Contact

svjack - svjackbt@gmail.com

Project Link: https://github.com/svjack/Sbert-ChineseExample

Acknowledgements