tvecs.preprocessor package¶

Submodules¶

tvecs.preprocessor.base_preprocessor module¶

Module used to specify abstract Preprocessor Class.

BasePreprocessor is an Abstract Base Class
with basic abstract preprocessor functionality.

class tvecs.preprocessor.base_preprocessor.BasePreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, limit=None)[source]¶

Bases: object

Abstract Base Class with basic preprocessor functionality.

API Documentation:

param corpus_fname: Corpus Filename to be preprocessed
param corpus_dir_path: Corpus Directory Path [ Default Current Directory ]
param encoding: Encoding format of the corpus [ Default utf-8 ]
param need_preprocessing: Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
param limit: Number of tokenized words to be limited to [ Default None ]
type limit: Integer
type corpus_fname: String
type corpus_dir_path: String
type encoding: String
type need_preprocessing: Boolean

Private Methods

abstract _extract_corpus_data(data)[source]¶

Extract valid content from the Corpus.

Executed only if need_preprocessing is set to True

abstract _clean_word(word)[source]¶

After Tokenizing into words, function is called for individual word.

Called by __iter__() which returns list of words.

abstract _tokenize_sentences(data)[source]¶: Function to tokenize corpus data into sentences.

abstract _tokenize_words(sentence)[source]¶: Function to tokenize sentences into words.

get_preprocessed_text(limit=None)[source]¶

Generator generates preprocessed list of tokenized words on every call.

Read Sentence tokenized intermediate preprocessed file.
Tokenize and preprocess words, return list of words from a sentence.

tvecs.preprocessor.emille_preprocessor module¶

EMILLE Corpus Preprocessor which inherits from BasePreprocessor.

class tvecs.preprocessor.emille_preprocessor.EmilleCorpusPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', language='english', need_preprocessing=False, limit=None)[source]¶

Bases: tvecs.preprocessor.base_preprocessor.BasePreprocessor

Emille Corpus Preprocessor which preprocesses the EMILLE Corpus.

API Documentation:

param corpus_fname: Corpus Filename to be preprocessed
param corpus_dir_path: Corpus Directory Path [ Default Current Directory ]
param encoding: Encoding format of the corpus [ Default utf-8 ]
param language: Language of the model constructed [ Default English ]
param limit: Number of tokenized words to be limited to [ Default None ]
param need_preprocessing: Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
type corpus_fname: String
type corpus_dir_path: String
type encoding: String
type language: String
type limit: Integer
type need_preprocessing: Boolean

Private Methods

_extract_corpus_data(data)[source]¶: Extract contents of the ‘p’ tags which contain the body.

_clean_word(word)[source]¶

Preprocess words after tokenizing words from sentences.

Remove punctuations.
Remove English words from Non-English corpus data.

_tokenize_sentences(data)[source]¶

Sentence tokenize corpus.

Sentence Tokenize the corpus using NLTK.
Remove punctuations [ except space ] from each individual sentences.

See also

nltk.tokenizers

_tokenize_words(sentence)[source]¶: Tokenize Words from sentences.

See also

tvecs.preprocessor.base_preprocessor.BasePreprocessor

tvecs.preprocessor.hccorpus_preprocessor module¶

HC Corpus Preprocessor which inherits from BasePreprocessor.

class tvecs.preprocessor.hccorpus_preprocessor.HcCorpusPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]¶

Bases: tvecs.preprocessor.base_preprocessor.BasePreprocessor

Hc-Corpus Preprocessor which preprocesses the Hc-Corpus.

API Documentation:

param corpus_fname: Corpus Filename to be preprocessed
param corpus_dir_path: Corpus Directory Path [ Default Current Directory ]
param encoding: Encoding format of the corpus [ Default utf-8 ]
param language: Language of the model constructed [ Default English ]
param limit: Number of tokenized words to be limited to [ Default None ]
param need_preprocessing: Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
type corpus_fname: String
type corpus_dir_path: String
type encoding: String
type language: String
type limit: Integer
type need_preprocessing: Boolean

Private Methods

_extract_corpus_data(data)[source]¶: Extract 4th column of corpus which contains the body.

_clean_word(word)[source]¶

Preprocess words after tokenizing words from sentences.

Remove apostrophes [‘s, s’].
Bring to lowercase.
Remove punctuations.
Remove English words from Non-English corpus data.

_tokenize_sentences(data)[source]¶

Sentence tokenize corpus.

Sentence Tokenize the corpus using NLTK.
Remove punctuations [ except space ] from each individual sentences.

See also

nltk.tokenizers

_tokenize_words(sentence)[source]¶: Tokenize Words from sentences.

See also

tvecs.preprocessor.base_preprocessor.BasePreprocessor

tvecs.preprocessor.leipzig_preprocessor module¶

Leipzig Preprocessor which inherits from BasePreprocessor.

class tvecs.preprocessor.leipzig_preprocessor.LeipzigPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]¶

Bases: tvecs.preprocessor.base_preprocessor.BasePreprocessor

Leipzig Preprocessor which preprocesses the Leipzig-Corpus.

API Documentation:

param corpus_fname: Corpus Filename to be preprocessed
param corpus_dir_path: Corpus Directory Path [ Default Current Directory ]
param encoding: Encoding format of the corpus [ Default utf-8 ]
param language: Language of the model constructed [ Default English ]
param limit: Number of tokenized words to be limited to [ Default None ]
param need_preprocessing: Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
type corpus_fname: String
type corpus_dir_path: String
type encoding: String
type language: String
type limit: Integer
type need_preprocessing: Boolean

Private Methods

_extract_corpus_data(data)[source]¶

Function not utilised for Leipzig Corpus.

Executed only if need_preprocessing is set to True

_clean_word(word)[source]¶

Preprocess words after tokenizing words from sentences.

Remove apostrophes [‘s, s’].
Bring to lowercase.
Remove punctuations.
Remove English words from Non-English corpus data.

_tokenize_sentences(data)[source]¶

Function to tokenize corpus data into sentences.

Function not utilised for Leipzig Corpus

See also

nltk.tokenizers

_tokenize_words(sentence)[source]¶: Tokenize Words from sentences.

See also

tvecs.preprocessor.base_preprocessor.BasePreprocessor

tvecs.preprocessor.yandex_api module¶

Utilise Yandex Translation Service.

Obtain bilingual semantic human score.

tvecs.preprocessor.yandex_api.get_translation(word, from_to)[source]¶

Obtain translation of specified word from Yandex.

API Documentation

param word: word to be translated
param from_to: language codes pair representing the src/target lang
type from_to: String
type word: String
return: translated word
rtype: String

tvecs.preprocessor.yandex_api.get_valid_translation(word, from_to)[source]¶

Ensure the translation is valid.

Return only single word translations. If multiple words translations, return None.

API Documentation

param word: word to be translated
param from_to: language codes pair representing the src/target lang
type from_to: String
type word: String
return: translated word
rtype: String

tvecs.preprocessor.yandex_api.yandex_api(lang_translate, input_score_path, output_score_path)[source]¶

Utilise Yandex Translation Service, obtain bilingual semantic human score.

WordSim score, translated on one column using Yandex.
Yandex Api Key, lang for translation needs to be provided

tvecs.preprocessor package¶

Submodules¶

tvecs.preprocessor.base_preprocessor module¶

tvecs.preprocessor.emille_preprocessor module¶

tvecs.preprocessor.hccorpus_preprocessor module¶

tvecs.preprocessor.leipzig_preprocessor module¶

tvecs.preprocessor.yandex_api module¶

Module contents¶