tvecs.preprocessor package¶
Submodules¶
tvecs.preprocessor.base_preprocessor module¶
Module used to specify abstract Preprocessor Class.
- BasePreprocessor is an Abstract Base Class
with basic abstract preprocessor functionality.
-
class
tvecs.preprocessor.base_preprocessor.BasePreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, limit=None)[source]¶ Bases:
objectAbstract Base Class with basic preprocessor functionality.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- type limit
Integer- type corpus_fname
String- type corpus_dir_path
String- type encoding
String- type need_preprocessing
Boolean
- Private Methods
-
abstract
_extract_corpus_data(data)[source]¶ Extract valid content from the Corpus.
Executed only if need_preprocessing is set to True
-
abstract
tvecs.preprocessor.emille_preprocessor module¶
EMILLE Corpus Preprocessor which inherits from BasePreprocessor.
-
class
tvecs.preprocessor.emille_preprocessor.EmilleCorpusPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', language='english', need_preprocessing=False, limit=None)[source]¶ Bases:
tvecs.preprocessor.base_preprocessor.BasePreprocessorEmille Corpus Preprocessor which preprocesses the EMILLE Corpus.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param language
Language of the model constructed [ Default English ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- type corpus_fname
String- type corpus_dir_path
String- type encoding
String- type language
String- type limit
Integer- type need_preprocessing
Boolean
- Private Methods
-
-
_clean_word(word)[source]¶ Preprocess words after tokenizing words from sentences.
Remove punctuations.
Remove English words from Non-English corpus data.
-
tvecs.preprocessor.hccorpus_preprocessor module¶
HC Corpus Preprocessor which inherits from BasePreprocessor.
-
class
tvecs.preprocessor.hccorpus_preprocessor.HcCorpusPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]¶ Bases:
tvecs.preprocessor.base_preprocessor.BasePreprocessorHc-Corpus Preprocessor which preprocesses the Hc-Corpus.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param language
Language of the model constructed [ Default English ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- type corpus_fname
String- type corpus_dir_path
String- type encoding
String- type language
String- type limit
Integer- type need_preprocessing
Boolean
- Private Methods
-
-
_clean_word(word)[source]¶ Preprocess words after tokenizing words from sentences.
Remove apostrophes [‘s, s’].
Bring to lowercase.
Remove punctuations.
Remove English words from Non-English corpus data.
-
tvecs.preprocessor.leipzig_preprocessor module¶
Leipzig Preprocessor which inherits from BasePreprocessor.
-
class
tvecs.preprocessor.leipzig_preprocessor.LeipzigPreprocessor(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]¶ Bases:
tvecs.preprocessor.base_preprocessor.BasePreprocessorLeipzig Preprocessor which preprocesses the Leipzig-Corpus.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param language
Language of the model constructed [ Default English ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- type corpus_fname
String- type corpus_dir_path
String- type encoding
String- type language
String- type limit
Integer- type need_preprocessing
Boolean
- Private Methods
-
_extract_corpus_data(data)[source]¶ Function not utilised for Leipzig Corpus.
Executed only if need_preprocessing is set to True
-
_clean_word(word)[source]¶ Preprocess words after tokenizing words from sentences.
Remove apostrophes [‘s, s’].
Bring to lowercase.
Remove punctuations.
Remove English words from Non-English corpus data.
-
tvecs.preprocessor.yandex_api module¶
Utilise Yandex Translation Service.
Obtain bilingual semantic human score.
-
tvecs.preprocessor.yandex_api.get_translation(word, from_to)[source]¶ Obtain translation of specified word from Yandex.
- API Documentation
- param word
word to be translated
- param from_to
language codes pair representing the src/target lang
- type from_to
String
- type word
String
- return
translated word
- rtype
String
-
tvecs.preprocessor.yandex_api.get_valid_translation(word, from_to)[source]¶ Ensure the translation is valid.
Return only single word translations. If multiple words translations, return None.
- API Documentation
- param word
word to be translated
- param from_to
language codes pair representing the src/target lang
- type from_to
String
- type word
String
- return
translated word
- rtype
String