MUSE: Multilingual Unsupervised and Supervised Embeddings

Dec. 24, 2017, 11:04 a.m. By: Kirti Bakshi

MUSE

MUSE is a Python library that is meant for multilingual word embeddings, and whose goal is to provide the community with:

  • state-of-the-art multilingual word embeddings that are based on fastText

  • large-scale high-quality bilingual dictionaries for the purpose of training and evaluation

Going in-depth, MUSE is a library in python that is used to align embedding spaces in a supervised or an unsupervised way. While the supervised method makes the use of a bilingual dictionary or identical character strings. The unsupervised approach does not make the use of any parallel data. Instead, it builds a bilingual dictionary between two different languages by the aligning of monolingual word embedding spaces in an unsupervised way.

The python library has now been open-sourced by Facebook and MUSE has state-of-the-art multilingual word embeddings for over 30 languages that are based on fastText as mentioned before.

FastText is a library that is used for efficient learning of word representations as well as sentence classification. fastText can also be used for making word embeddings with the use of word2vec, CBOW (Continuous Bag of Words) or even Skipgram and then further use it for text classification.

The MUSE library provides you with the following:

1. Get evaluation datasets:

Get cross-lingual as well as monolingual word embeddings evaluation datasets:

  • The 110 bilingual dictionaries

  • 28 monolingual word similarity tasks for 6 different languages, and in addition to it, the English word analogy task

  • Cross-lingual word similarity tasks from SemEval2017

  • Sentence translation retrieval with Europarl corpora

2. Get monolingual word embeddings:

For pre-trained monolingual word embeddings, it is highly recommend to make the use of fastText Wikipedia embeddings, or fastText in order to train your very own word embeddings from your corpus.

3. Alignment of monolingual word embeddings:

This project comprises of two different ways that help achieve cross-lingual word embeddings:

  • Supervised: Making the use of a train bilingual dictionary or identical character strings as anchor points, learn a mapping from the source to the target space with the use of (iterative) Procrustes alignment.

  • Unsupervised: without the use of any parallel data or anchor point, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement.

4. Evaluate monolingual or cross-lingual embeddings (CPU|GPU):

The library also includes within itself a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks that take place.

5. Ground-truth bilingual dictionaries:

There is a creation of 110 large-scale ground-truth bilingual dictionaries with the use of an internal translation tool. The dictionaries very well handle the polysemy of words. It also provides a train and test split of 5000 and 1500 unique source words, as well as a larger set that goes up to 100k pairs.

The main goal is to ease the development as well as the evaluation of cross-lingual word embeddings and multilingual NLP(Natural Language Processing).

Dependencies:

  • Python 2/3 with NumPy/SciPy

  • PyTorch

  • Faiss (recommended) for fast nearest neighbour search (CPU or GPU).

While the library is available on CPU or GPU, in Python 2 or 3, Faiss is considered to be optional for GPU users as with Faiss-GPU will greatly speed up the nearest neighbour search - and is therefore highly recommended for CPU users. For More information, one can go through the link mentioned below.

More Information: GitHub