T-Vecs¶
Prerequisites¶
Python 2.7 setup and installed
Pip setup and installed
Ensure all dependencies of requirements.txt are satisfied
Download nltk_data using nltk.download() -> only tokenizers required
Download corpus and extract in specified directory
Setup Development Environment¶
git clone https://github.com/KshitijKarthick/tvecs.git
cd tvecs
pip install -r requirements.txt
# Only Model needs to be downloaded and extracted in the t-vex directory
Install as a Package¶
# Install package
pip install git+https://github.com/KshitijKarthick/tvecs.git
# Usage from cmd line without recommendations menu
tvecs -c ./config.json
# Usage from cmd line with recommendations menu
tvecs -c ./config.json -r
# Usage without config file, with models, without recommendations menu
tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models
# Usage without config file, with models, with recommendations menu
tvecs -r -l1 english -l2 hindi -m1 ./data/models/t-vex-english-models -m2 ./data/models/t-vex-hindi-models
# Usage from inside python as a library
import tvecs.vector_space_mapper.vector_space_mapper as vm
Data¶
Corpus Download details¶
We are focusing on [English, Hindi] other possible prospects we could look into Kannada, Tamil languages
- Sources
Emille Corpora http://www.emille.lancs.ac.uk/
Leipzig Corpora http://corpora.uni-leipzig.de/
Bilingual Dictionary details¶
Provided in the repository, data/bilingual_dictionary. Compiled using the following sources.
- Credits
Shabdakosh http://www.shabdkosh.com/content/category/downloads/
Dicts Corpora http://dicts.info/dictlist1.php?l=Hindi
Evaluation Dataset details¶
Human relatedness judgement score datasets provided in data/evaluate
- Credits
wordsim_relatedness_goldstandard
MEN_dataset_natural_form_full
Mturk_287
Mturk_771
Ensure Model is downloaded and extracted in the t-vex directory¶
data/corpus -> corpus
data/models -> models
Usage Details¶
T-Vecs Driver Module Cmd Line Args¶
$ python -m tvecs --help
usage: __main__.py [-h] [-v] [-s] [-i ITER] [-m1 MODEL1] [-m2 MODEL2]
[-l1 LANGUAGE1] [-l2 LANGUAGE2] [-c CONFIG]
[-b BILINGUAL_DICT] [-r]
Script used to generate models
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-s, --silent silence all logging
-i ITER, --iter ITER number of Word2Vec iterations
-m1 MODEL1, --model1 MODEL1
pre-computed model file path
-m2 MODEL2, --model2 MODEL2
pre-computed model file path
-l1 LANGUAGE1, --language1 LANGUAGE1
language name of model 1/ text 1
-l2 LANGUAGE2, --l2 LANGUAGE2
language name of model 2/ text 2
-c CONFIG, --config CONFIG
config file path
-b BILINGUAL_DICT, --bilingual BILINGUAL_DICT
bilingual dictionary path
-r, --recommendations
provide recommendations
Config File Format¶
See config.json in the repository for example.
Execution & Building¶
# Preprocessing, Model Generation, Bilingual Generation, Vector Space Mapping between two languages english hindi from the corpus using the config file
python -im tvecs -c config.json
# [ utilise the dictionary tvex_calls which contains results of every step performed ]
# Bilingual generation, Vector space mapping between two languages english hindi providing the models
python -im tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train
python -im tvecs -c config.json
# [ utilise the dictionary tvex_calls which contains results of every step performed ]
Obtain Recommendations¶
# Provide Recommendations using config file
python -m tvecs -c ./config.json -r
# Provide Recommendations using cmd line params
python2 -m tvecs -l1 english -l2 hindi -m1 ./data/models/t-vex-english-model -m2 ./data/models/t-vex-hindi-model -b ./data/bilingual_dictionary/english_hindi_train_bd -r
# Output for recommendations
Enter your Choice:
1> Recommendation
2> Exit
Choice: 1
Enter word in Language english: examination
Word => Score
जाँच => 0.643208742142
नियुक्ति => 0.640852451324
जांच => 0.638412773609
अध्ययन => 0.638307392597
विवेचना => 0.638229370117
मंत्रणा => 0.634038448334
पुनर्मूल्यांकन => 0.627283990383
अध्ययन => 0.624040842056
निरीक्षण => 0.623490035534
जाच => 0.619904220104
Visualisation of vector space¶
python -m tvecs.visualization.server
[ Open browser to localhost:5000 for visualization ]
[ Ensure model generation is completed before running visualization ]
Execution of Individual Modules¶
# bilingual dictionary generation -> clustering vectors from trained model
python -m tvecs.bilingual_generator.clustering
# model generation
python -m tvecs.model_generator.model_generation
# vector space mapping [ utilise the object vm to obtain recommendations
python -m tvecs.vector_space_mapper.vector_space_mapper
Execution of Unit Tests¶
# Run all unit tests
py.test
# Run individual module tests seperately
py.test tests/test_emille_preprocessor.py
py.test tests/test_leipzig_preprocessor.py
py.test tests/test_hccorpus_preprocessor.py
Generate Documentation¶
# Generate HTML Documentation
make html
cd documentation/html && python -m SimpleHTTPServer
# [ Open browser to localhost:8000 for visualization ]
# Generate Man Pages
make man
cd documentation/man && man -l tvecs.1
# Other Makefile options
make
Please use `make <target>' where <target> is one of
html to make standalone HTML files
dirhtml to make HTML files named index.html in directories
singlehtml to make a single large HTML file
pickle to make pickle files
json to make JSON files
htmlhelp to make HTML files and a HTML help project
qthelp to make HTML files and a qthelp project
applehelp to make an Apple Help Book
devhelp to make HTML files and a Devhelp project
epub to make an epub
epub3 to make an epub3
latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
latexpdf to make LaTeX files and run them through pdflatex
latexpdfja to make LaTeX files and run them through platex/dvipdfmx
text to make text files
man to make manual pages
texinfo to make Texinfo files
info to make Texinfo files and run them through makeinfo
gettext to make PO message catalogs
changes to make an overview of all changed/added/deprecated items
xml to make Docutils-native XML files
pseudoxml to make pseudoxml-XML files for display purposes
linkcheck to check all external links for integrity
doctest to run all doctests embedded in the documentation (if enabled)
coverage to run coverage check of the documentation (if enabled)