ktrain: A Low-Code Library for Augmented Machine Learning

Authors: Arun S. Maiya

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present ktrain, a low-code Python library that makes machine learning more accessible and easier to apply. To illustrate ease of use, we provide fully-complete example for text classification. More specifically, we train a Chinese-language sentiment-analyzer on a dataset of hotel reviews. Fine-Tuning a BERT Text Classifier for Chinese: import ktrain from ktrain text as txt # STEP 1: load and preprocess data trn , val , preproc = txt . texts_from_folder( ' Chn Senti Corp ' , maxlen=75, preprocess_mode=' bert ' ) # STEP 2: load model and wrap in Learner model = txt . text_classifier( ' bert ' , trn , preproc=preproc) learner = ktrain . get_learner(model , train_data=trn , val_data=val) # STEP 3: estimate l e a r n i n g rate learner . lr_find(show_plot=True) # STEP 4: t r a i n model learner . fit_onecycle(2e 5, 4) Table 1 compares ktrain to popular low-code and Auto ML libraries in their out-of-the-box support for a variety of machine learning tasks.
Researcher Affiliation Industry Arun S. Maiya EMAIL Institute for Defense Analyses Alexandria, VA, USA
Pseudocode No The paper includes Python code examples for demonstrating the library's use, such as 'Fine-Tuning a BERT Text Classifier for Chinese:' and 'Building an End-to-End Open-Domain QA System in ktrain'. These are actual code blocks, not pseudocode or algorithm blocks. The description of steps (e.g., STEP 1: Load and Preprocess Data) is in natural language prose.
Open Source Code Yes ktrain is open-source, free to use under a permissive Apache license, and available on Git Hub at: https://github.com/amaiya/ktrain.
Open Datasets Yes More specifically, we train a Chinese-language sentiment-analyzer on a dataset of hotel reviews.2 (Footnote 2: https://github.com/Tony607/Chinese_sentiment_analysis) using the well-studied 20 Newsgroups dataset.3 (Footnote 3: http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups)
Dataset Splits No The paper mentions loading training and validation data (e.g., 'trn , val , preproc = txt . texts_from_folder(...)') for the Chinese sentiment analysis example and loading documents into a list ('docs') for the 20 Newsgroups QA system. However, it does not specify the exact percentages, sample counts, or a detailed methodology for how these datasets were split into training, validation, or test sets.
Hardware Specification No The paper generally mentions that 'fast models such as fast Text... and NBSVM... are amenable to being trained on a standard laptop CPU.' This is a general statement about capability, not a specific hardware specification used for running the experiments described in the paper. No specific CPU models, GPU models, or other detailed hardware configurations are provided.
Software Dependencies No The paper mentions several software components like 'Python library', 'TensorFlow', 'transformers', 'scikit-learn', and 'stellargraph', and provides Python code examples that import 'ktrain'. However, it does not provide specific version numbers for any of these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes The example for 'Fine-Tuning a BERT Text Classifier for Chinese:' includes the line: 'learner . fit_onecycle(2e 5, 4)', which explicitly provides a learning rate (2e-5) and the number of epochs (4) for the training process.