Skip to content

Commit b978ff8

Browse files
authored
Add support for ViTMatte models (#448)
* Add support for `VitMatte` models * Add `VitMatteImageProcessor` * Add `VitMatteImageProcessor` unit test * Fix typo * Add example code for `VitMatteForImageMatting` * Fix JSDoc * Fix typo
1 parent 80d22da commit b978ff8

File tree

6 files changed

+202
-29
lines changed

6 files changed

+202
-29
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -330,6 +330,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
330330
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
331331
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
332332
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
333+
1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
333334
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
334335
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
335336
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

docs/snippets/6_supported-models.snippet

+1
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@
6666
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
6767
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
6868
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
69+
1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
6970
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
7071
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
7172
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

scripts/supported_models.py

+9
Original file line numberDiff line numberDiff line change
@@ -775,6 +775,15 @@
775775
'google/vit-base-patch16-224',
776776
],
777777
},
778+
'vitmatte': {
779+
# Image matting
780+
'image-matting': [
781+
'hustvl/vitmatte-small-distinctions-646',
782+
'hustvl/vitmatte-base-distinctions-646',
783+
'hustvl/vitmatte-small-composition-1k',
784+
'hustvl/vitmatte-base-composition-1k',
785+
],
786+
},
778787
'wav2vec2': {
779788
# Feature extraction # NOTE: requires --task feature-extraction
780789
'feature-extraction': [

src/models.js

+87-1
Original file line numberDiff line numberDiff line change
@@ -3441,6 +3441,74 @@ export class ViTForImageClassification extends ViTPreTrainedModel {
34413441
}
34423442
//////////////////////////////////////////////////
34433443

3444+
//////////////////////////////////////////////////
3445+
export class VitMattePreTrainedModel extends PreTrainedModel { }
3446+
3447+
/**
3448+
* ViTMatte framework leveraging any vision backbone e.g. for ADE20k, CityScapes.
3449+
*
3450+
* **Example:** Perform image matting with a `VitMatteForImageMatting` model.
3451+
* ```javascript
3452+
* import { AutoProcessor, VitMatteForImageMatting, RawImage } from '@xenova/transformers';
3453+
*
3454+
* // Load processor and model
3455+
* const processor = await AutoProcessor.from_pretrained('Xenova/vitmatte-small-distinctions-646');
3456+
* const model = await VitMatteForImageMatting.from_pretrained('Xenova/vitmatte-small-distinctions-646');
3457+
*
3458+
* // Load image and trimap
3459+
* const image = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_image.png');
3460+
* const trimap = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_trimap.png');
3461+
*
3462+
* // Prepare image + trimap for the model
3463+
* const inputs = await processor(image, trimap);
3464+
*
3465+
* // Predict alpha matte
3466+
* const { alphas } = await model(inputs);
3467+
* // Tensor {
3468+
* // dims: [ 1, 1, 640, 960 ],
3469+
* // type: 'float32',
3470+
* // size: 614400,
3471+
* // data: Float32Array(614400) [ 0.9894027709960938, 0.9970508813858032, ... ]
3472+
* // }
3473+
* ```
3474+
*
3475+
* You can visualize the alpha matte as follows:
3476+
* ```javascript
3477+
* import { Tensor, cat } from '@xenova/transformers';
3478+
*
3479+
* // Visualize predicted alpha matte
3480+
* const imageTensor = new Tensor(
3481+
* 'uint8',
3482+
* new Uint8Array(image.data),
3483+
* [image.height, image.width, image.channels]
3484+
* ).transpose(2, 0, 1);
3485+
*
3486+
* // Convert float (0-1) alpha matte to uint8 (0-255)
3487+
* const alphaChannel = alphas
3488+
* .squeeze(0)
3489+
* .mul_(255)
3490+
* .clamp_(0, 255)
3491+
* .round_()
3492+
* .to('uint8');
3493+
*
3494+
* // Concatenate original image with predicted alpha
3495+
* const imageData = cat([imageTensor, alphaChannel], 0);
3496+
*
3497+
* // Save output image
3498+
* const outputImage = RawImage.fromTensor(imageData);
3499+
* outputImage.save('output.png');
3500+
* ```
3501+
*/
3502+
export class VitMatteForImageMatting extends VitMattePreTrainedModel {
3503+
/**
3504+
* @param {any} model_inputs
3505+
*/
3506+
async _call(model_inputs) {
3507+
return new ImageMattingOutput(await super._call(model_inputs));
3508+
}
3509+
}
3510+
//////////////////////////////////////////////////
3511+
34443512
//////////////////////////////////////////////////
34453513
export class MobileViTPreTrainedModel extends PreTrainedModel { }
34463514
export class MobileViTModel extends MobileViTPreTrainedModel { }
@@ -4827,7 +4895,9 @@ const MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = new Map([
48274895
['audio-spectrogram-transformer', ['ASTForAudioClassification', ASTForAudioClassification]],
48284896
]);
48294897

4830-
4898+
const MODEL_FOR_IMAGE_MATTING_MAPPING_NAMES = new Map([
4899+
['vitmatte', ['VitMatteForImageMatting', VitMatteForImageMatting]],
4900+
]);
48314901

48324902
const MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = new Map([
48334903
['swin2sr', ['Swin2SRForImageSuperResolution', Swin2SRForImageSuperResolution]],
@@ -4853,6 +4923,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
48534923
[MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES, MODEL_TYPES.Vision2Seq],
48544924
[MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
48554925
[MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
4926+
[MODEL_FOR_IMAGE_MATTING_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
48564927
[MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
48574928
[MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
48584929
[MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
@@ -5058,6 +5129,10 @@ export class AutoModelForDocumentQuestionAnswering extends PretrainedMixin {
50585129
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES];
50595130
}
50605131

5132+
export class AutoModelForImageMatting extends PretrainedMixin {
5133+
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_MATTING_MAPPING_NAMES];
5134+
}
5135+
50615136
export class AutoModelForImageToImage extends PretrainedMixin {
50625137
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES];
50635138
}
@@ -5177,3 +5252,14 @@ export class CausalLMOutputWithPast extends ModelOutput {
51775252
this.past_key_values = past_key_values;
51785253
}
51795254
}
5255+
5256+
export class ImageMattingOutput extends ModelOutput {
5257+
/**
5258+
* @param {Object} output The output of the model.
5259+
* @param {Tensor} output.alphas Estimated alpha values, of shape `(batch_size, num_channels, height, width)`.
5260+
*/
5261+
constructor({ alphas }) {
5262+
super();
5263+
this.alphas = alphas;
5264+
}
5265+
}

0 commit comments

Comments
 (0)