tvecs.preprocessor package¶
Submodules¶
tvecs.preprocessor.base_preprocessor module¶
Module used to specify abstract Preprocessor Class.
- BasePreprocessor is an Abstract Base Class
with basic abstract preprocessor functionality.
-
class
tvecs.preprocessor.base_preprocessor.
BasePreprocessor
(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, limit=None)[source]¶ Bases:
object
Abstract Base Class with basic preprocessor functionality.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- type limit
Integer
- type corpus_fname
String
- type corpus_dir_path
String
- type encoding
String
- type need_preprocessing
Boolean
- Private Methods
-
abstract
_extract_corpus_data
(data)[source]¶ Extract valid content from the Corpus.
Executed only if need_preprocessing is set to True
-
abstract
tvecs.preprocessor.emille_preprocessor module¶
EMILLE Corpus Preprocessor which inherits from BasePreprocessor.
-
class
tvecs.preprocessor.emille_preprocessor.
EmilleCorpusPreprocessor
(corpus_fname, corpus_dir_path='.', encoding='utf-8', language='english', need_preprocessing=False, limit=None)[source]¶ Bases:
tvecs.preprocessor.base_preprocessor.BasePreprocessor
Emille Corpus Preprocessor which preprocesses the EMILLE Corpus.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param language
Language of the model constructed [ Default English ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- type corpus_fname
String
- type corpus_dir_path
String
- type encoding
String
- type language
String
- type limit
Integer
- type need_preprocessing
Boolean
- Private Methods
-
-
_clean_word
(word)[source]¶ Preprocess words after tokenizing words from sentences.
Remove punctuations.
Remove English words from Non-English corpus data.
-
tvecs.preprocessor.hccorpus_preprocessor module¶
HC Corpus Preprocessor which inherits from BasePreprocessor.
-
class
tvecs.preprocessor.hccorpus_preprocessor.
HcCorpusPreprocessor
(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]¶ Bases:
tvecs.preprocessor.base_preprocessor.BasePreprocessor
Hc-Corpus Preprocessor which preprocesses the Hc-Corpus.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param language
Language of the model constructed [ Default English ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- type corpus_fname
String
- type corpus_dir_path
String
- type encoding
String
- type language
String
- type limit
Integer
- type need_preprocessing
Boolean
- Private Methods
-
-
_clean_word
(word)[source]¶ Preprocess words after tokenizing words from sentences.
Remove apostrophes [‘s, s’].
Bring to lowercase.
Remove punctuations.
Remove English words from Non-English corpus data.
-
tvecs.preprocessor.leipzig_preprocessor module¶
Leipzig Preprocessor which inherits from BasePreprocessor.
-
class
tvecs.preprocessor.leipzig_preprocessor.
LeipzigPreprocessor
(corpus_fname, corpus_dir_path='.', encoding='utf-8', need_preprocessing=False, language='english', limit=None)[source]¶ Bases:
tvecs.preprocessor.base_preprocessor.BasePreprocessor
Leipzig Preprocessor which preprocesses the Leipzig-Corpus.
- API Documentation:
- param corpus_fname
Corpus Filename to be preprocessed
- param corpus_dir_path
Corpus Directory Path [ Default Current Directory ]
- param encoding
Encoding format of the corpus [ Default utf-8 ]
- param language
Language of the model constructed [ Default English ]
- param limit
Number of tokenized words to be limited to [ Default None ]
- param need_preprocessing
Preprocess corpus to obtain only the valid content from the file to an intermediate file [ False - Corpus has each sentence in seperate lines ]
- type corpus_fname
String
- type corpus_dir_path
String
- type encoding
String
- type language
String
- type limit
Integer
- type need_preprocessing
Boolean
- Private Methods
-
_extract_corpus_data
(data)[source]¶ Function not utilised for Leipzig Corpus.
Executed only if need_preprocessing is set to True
-
_clean_word
(word)[source]¶ Preprocess words after tokenizing words from sentences.
Remove apostrophes [‘s, s’].
Bring to lowercase.
Remove punctuations.
Remove English words from Non-English corpus data.
-
tvecs.preprocessor.yandex_api module¶
Utilise Yandex Translation Service.
Obtain bilingual semantic human score.
-
tvecs.preprocessor.yandex_api.
get_translation
(word, from_to)[source]¶ Obtain translation of specified word from Yandex.
- API Documentation
- param word
word to be translated
- param from_to
language codes pair representing the src/target lang
- type from_to
String
- type word
String
- return
translated word
- rtype
String
-
tvecs.preprocessor.yandex_api.
get_valid_translation
(word, from_to)[source]¶ Ensure the translation is valid.
Return only single word translations. If multiple words translations, return None.
- API Documentation
- param word
word to be translated
- param from_to
language codes pair representing the src/target lang
- type from_to
String
- type word
String
- return
translated word
- rtype
String