tvecs.preprocessor package

Submodules

tvecs.preprocessor.base_preprocessor module

Module used to specify abstract Preprocessor Class.

  • BasePreprocessor is an Abstract Base Class

    with basic abstract preprocessor functionality.

class tvecs.preprocessor.base_preprocessor.BasePreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, limit=None)[source]

Bases: object

Abstract Base Class with basic preprocessor functionality.

API Documentation:
param corpus_fname

Corpus Filename to be preprocessed

param corpus_dir_path

Corpus Directory Path [ Default Current Directory ]

param encoding

Encoding format of the corpus [ Default utf-8 ]

param need_preprocessing

Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]

param limit

Number of tokenized words to be limited to [ Default None ]

type limit

Integer

type corpus_fname

String

type corpus_dir_path

String

type encoding

String

type need_preprocessing

Boolean

Private Methods
abstract _extract_corpus_data(data)[source]

Extract valid content from the Corpus.

  • Executed only if need_preprocessing is set to True

abstract _clean_word(word)[source]

After Tokenizing into words, function is called for individual word.

  • Called by __iter__() which returns list of words.

abstract _tokenize_sentences(data)[source]

Function to tokenize corpus data into sentences.

abstract _tokenize_words(sentence)[source]

Function to tokenize sentences into words.

get_preprocessed_text(limit=None)[source]

Generator generates preprocessed list of tokenized words on every call.

  • Read Sentence tokenized intermediate preprocessed file.

  • Tokenize and preprocess words, return list of words from a sentence.

tvecs.preprocessor.emille_preprocessor module

EMILLE Corpus Preprocessor which inherits from BasePreprocessor.

class tvecs.preprocessor.emille_preprocessor.EmilleCorpusPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', language='english', need_preprocessing=False, limit=None)[source]

Bases: tvecs.preprocessor.base_preprocessor.BasePreprocessor

Emille Corpus Preprocessor which preprocesses the EMILLE Corpus.

API Documentation:
param corpus_fname

Corpus Filename to be preprocessed

param corpus_dir_path

Corpus Directory Path [ Default Current Directory ]

param encoding

Encoding format of the corpus [ Default utf-8 ]

param language

Language of the model constructed [ Default English ]

param limit

Number of tokenized words to be limited to [ Default None ]

param need_preprocessing

Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]

type corpus_fname

String

type corpus_dir_path

String

type encoding

String

type language

String

type limit

Integer

type need_preprocessing

Boolean

Private Methods
_extract_corpus_data(data)[source]

Extract contents of the ‘p’ tags which contain the body.

_clean_word(word)[source]

Preprocess words after tokenizing words from sentences.

  • Remove punctuations.

  • Remove English words from Non-English corpus data.

_tokenize_sentences(data)[source]

Sentence tokenize corpus.

  • Sentence Tokenize the corpus using NLTK.

  • Remove punctuations [ except space ] from each individual sentences.

See also

  • nltk.tokenizers

_tokenize_words(sentence)[source]

Tokenize Words from sentences.

tvecs.preprocessor.hccorpus_preprocessor module

HC Corpus Preprocessor which inherits from BasePreprocessor.

class tvecs.preprocessor.hccorpus_preprocessor.HcCorpusPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]

Bases: tvecs.preprocessor.base_preprocessor.BasePreprocessor

Hc-Corpus Preprocessor which preprocesses the Hc-Corpus.

API Documentation:
param corpus_fname

Corpus Filename to be preprocessed

param corpus_dir_path

Corpus Directory Path [ Default Current Directory ]

param encoding

Encoding format of the corpus [ Default utf-8 ]

param language

Language of the model constructed [ Default English ]

param limit

Number of tokenized words to be limited to [ Default None ]

param need_preprocessing

Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]

type corpus_fname

String

type corpus_dir_path

String

type encoding

String

type language

String

type limit

Integer

type need_preprocessing

Boolean

Private Methods
_extract_corpus_data(data)[source]

Extract 4th column of corpus which contains the body.

_clean_word(word)[source]

Preprocess words after tokenizing words from sentences.

  • Remove apostrophes [‘s, s’].

  • Bring to lowercase.

  • Remove punctuations.

  • Remove English words from Non-English corpus data.

_tokenize_sentences(data)[source]

Sentence tokenize corpus.

  • Sentence Tokenize the corpus using NLTK.

  • Remove punctuations [ except space ] from each individual sentences.

See also

  • nltk.tokenizers

_tokenize_words(sentence)[source]

Tokenize Words from sentences.

tvecs.preprocessor.leipzig_preprocessor module

Leipzig Preprocessor which inherits from BasePreprocessor.

class tvecs.preprocessor.leipzig_preprocessor.LeipzigPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]

Bases: tvecs.preprocessor.base_preprocessor.BasePreprocessor

Leipzig Preprocessor which preprocesses the Leipzig-Corpus.

API Documentation:
param corpus_fname

Corpus Filename to be preprocessed

param corpus_dir_path

Corpus Directory Path [ Default Current Directory ]

param encoding

Encoding format of the corpus [ Default utf-8 ]

param language

Language of the model constructed [ Default English ]

param limit

Number of tokenized words to be limited to [ Default None ]

param need_preprocessing

Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]

type corpus_fname

String

type corpus_dir_path

String

type encoding

String

type language

String

type limit

Integer

type need_preprocessing

Boolean

Private Methods
_extract_corpus_data(data)[source]

Function not utilised for Leipzig Corpus.

  • Executed only if need_preprocessing is set to True

_clean_word(word)[source]

Preprocess words after tokenizing words from sentences.

  • Remove apostrophes [‘s, s’].

  • Bring to lowercase.

  • Remove punctuations.

  • Remove English words from Non-English corpus data.

_tokenize_sentences(data)[source]

Function to tokenize corpus data into sentences.

  • Function not utilised for Leipzig Corpus

See also

  • nltk.tokenizers

_tokenize_words(sentence)[source]

Tokenize Words from sentences.

tvecs.preprocessor.yandex_api module

Utilise Yandex Translation Service.

  • Obtain bilingual semantic human score.

tvecs.preprocessor.yandex_api.get_translation(word, from_to)[source]

Obtain translation of specified word from Yandex.

API Documentation
param word

word to be translated

param from_to

language codes pair representing the src/target lang

type from_to

String

type word

String

return

translated word

rtype

String

tvecs.preprocessor.yandex_api.get_valid_translation(word, from_to)[source]

Ensure the translation is valid.

Return only single word translations. If multiple words translations, return None.

API Documentation
param word

word to be translated

param from_to

language codes pair representing the src/target lang

type from_to

String

type word

String

return

translated word

rtype

String

tvecs.preprocessor.yandex_api.yandex_api(lang_translate, input_score_path, output_score_path)[source]

Utilise Yandex Translation Service, obtain bilingual semantic human score.

  • WordSim score, translated on one column using Yandex.

  • Yandex Api Key, lang for translation needs to be provided

Module contents