Regression vs. Classification
Regression
Regression is a type of supervised learning algorithm used to predict continuous numerical values based on input features. In regression, the goal is to find a mathematical relationship between the independent variables (features) and the dependent variable (target). This relationship is represented by a regression function, which can be linear or nonlinear, depending on the complexity of the data. Regression models aim to minimize the difference between predicted and actual values, typically measured using metrics such as Mean Square Error (MSE), and R-Squared (R2).
Classification
Classification is another type of supervised learning algorithm used to categorize input data into predefined categories (classes). Unlike regression, where the output is a continuous value, classification assigns discrete labels to data points based on their features. The goal of classification is to build a model that can accurately predict the class label of unseen data points. Classification problems are further distinguished depending on the number of classes entailed in the classification problem. The two major case are binary and multi-class classification problems, with the classification model trying to accomplish respectively:
The assignment of a data point in one of two possible classes (binary classification)
The assignment of a data point in one of three or more possible classes (multi-class classification)
Evaluation of classification models involves metrics such as accuracy, precision, recall, and F1 score.
How training data are structured in Xdeep
Xdeep can be used to produce models for both regression and classification problems. There are, however, some differences in the ways you have to prepare your data before submitting them for training a regression or a classification model.
Regression
To submit your data for training a regression model, make sure that:
Data are provided a a CSV file, with each row corresponding to a data point.
All features have numeric values.
The columns corresponding to features are continuous within the CSV file. You can select the start and end index of the features columns.
There must be a single target variable in the CSV file. You can again define the target column index in your dataset within the Xdeep app.
Binary Classification
To submit your data for training a binary classification model, you should follow the same data structuring guidelines as the regression case. Additionally, you have to associate your class labels with two numeric values with different signs. For example, if you want to define a problem for classifying between classes A and B, you can represent class A as -1 in the target variable column, and class B as 1.
Multi-class Classification
Multi-class classification problems in Xdeep are treated as a transformation of a regression problems, where value ranges are associated with a distinct class of the targeted classification scheme. In simple terms, you should represent your classes as ranges in a value space. For example, if the classification set contains classes A, B and C, you can represent class A with the value -2, class B with the value 0, and class C with the value 2. Following the guidelines for regression problems, you can train your model for a regression problem with the values ranging in [-2,2]
. When subsequently running the trained model, it is only a matter of assigning the returned value to the class with the closest corresponding value. For example, if the run of the model for an input data point returned 1.23
, the data point is assigned to class C. similarly, if the returned values was -0.18
, the data point is assigned to class B.
Single vs Multiple output models
Choosing to train a model to predict multiple target variables or to train separate models for each target variable depends on various factors such as the relationships between the target variables, the complexity of the problem, computational resources, and the specific goals of the analysis or application.
In broad terms, multiple output models are apt in cases where computational resources are of major consideration and when the target variables are demonstrably highly correlated or interdependent. However, training separate models for each target variable bears significant benefits in most cases: Individual models outperform a single model trained on all variables when the relationships between predictors and targets are different for each variable or when some variables are more predictable than others; training a single model for multiple target variables can increase the complexity of the model and the dimensionality of the problem; and separate models for individual target variables can be easier to interpret and understand, especially if each target variable corresponds to a distinct aspect of the problem domain.
Xdeep adopts the single output approach for its training models to effectively support most use cases and take full advantage of the underlying computational infrastructure, as individual models can be trained in parallel.