Logistic Regression in R

0
1870
Source - R Project website
Source - R Project website

                                                Logistic Regression in R

By Nishank Biswas                                                                                                      

Logistic Regression is a type of a supervised classification algorithm which uses the advantage of classical linear regression models and Bernoulli’s probability distribution. The essence of classical regression is captured in a way to extract the raw information from the data towards the classes. The classification is done among two classes viz. Positive and negative class and these classes are represented numerically as 1 and 0 respectively. However, it is customarily to denote the class of interest as a positive class and same convention has been followed throughout the section but the other way is also an option. The deviation of logistic regression from classical regression starts from the fact that the former uses activation function to encapsulate the regression output. In general the activation function used for this purpose is sigmoid function or called logistic function which can be represented as

Capture

One of the useful features of this function is that it converges asymptotically to 1 & 0 on moving towards positive & negative axis respectively and at origin the sigmoid function outputs value of 0.5. If we consider a logistic regression model

Capture2

where θ being the parameter vector and x being the data matrix. Then according to the convention adopted, the h(x) will denote the probability of data set belonging to the positive class parameterised by θ. If y represents the class as 1 and 0 then mathematically

Capture3

In order to make the classification a threshold value of h(x) is needed to be selected beyond or behind which the categorization between positive and negative class is to be made. However, it is important to note that the selection of the threshold is completely domain specific, in other words, selection of threshold depends on the objective of classification and can be varied to increase robustness of the decision. For a particular value of the threshold there will exist a unique value of z which can be obtained from activation function. And from the value of z & the estimated parameter of θ we would be able to define the decision boundary between the two classes and equation of which would be given by

Capture4

And classification will be made as positive class if

Capture5                                                                                     And negative class otherwise.

Although the shape of the boundary depends upon the functional form i.e., linear, quadratic or polynomial form but the properties of the decision boundary is solely controlled by the estimated parameters θ. Having mentioned that it becomes crucial to obtain the best values of vector θ and in order to do that one can inherit the pros of classical linear regression model and use the squared error term of the model as an indicator of the performance which is generally known as the cost function. But inheriting it intactly will induce a problem known as the problem of non-convexity and as a consequence of which the convergence of the final result cannot be assured to that of global minimum. In order to counter that problem, mutation has been made in the functional form of the cost function. And a more robust cost function is then derived from principle of maximum likelihood estimation which ultimately solves the problem of non-convexity too. And the cost function is given as

Capture6

 

The heart of the algorithm lies on the process of obtaining set of values for θ thereby minimizing the cost function hence obtained, for a given set of data. The estimated vector θ will then represent the framework for classification of the new data sets into predefined classes. Although there exists many algorithm to obtain convincing set of θ but one of the elementary method is known as Gradient Descent Algorithm. The working principle of this algorithm lies on its adaptability to learn from the result of its earlier performance which is generally scaled by a factor known as learning rate in order to have control over the trade-off between learning time and precision towards the global minimum. After the random initialization of the values of vector θ the updation can be represented mathematically as

{

Capture7

}

Where θ(j) iterates over all the parameters and the updation iterates simultaneously over all values of theta until convergence is observed. The interaction term with the learning rate α is known as the gradient of the cost function wrt the respective parameters and can be derived as that of

Capture8

One of the method of ensuring the convergence would be to track upon the value of J(θ) for each set of theta obtained after every iteration. Hence the stopping criteria could be the point from where no significant decrement in the value of J(θ) is observed. Being one of the elementary method the gradient descent algorithm possess some cons and the first of which is regarding the selection of the learning rate (α) as the higher learning rate can make the values oscillating between the global minimum but not converging to the same and lower learning rate will although ensure convergence but the time taken to do so can be very large therefore it becomes very important to maintain this trade-off which is a bit tedious job. Secondly, the time taken by the algorithm to make θ to converge to the global minimum can in general be significantly high because of extravagant number of iterations. To take care of the above listed problems many sophisticated algorithms have been developed viz. Conjugate Gradient, BGFS, L-BGFS which has inbuilt smart procedure to detect good value of α and also can change it for betterment. Apart from that these algorithms are also known for their relatively rapid convergence towards the global minimum. In order to do so these algorithms requires the definition of the cost function and gradients for individual parameters before hand as that was the case in Gradient Descent Algorithm. The one of the demerits of these algorithm is that they are very much complex to understand and visualize and therefore more often than not we got to rely on the predefined packages in order to use the advantages of these algorithms.

Several times in the real life situation we encounter a problem where classification has to be made between more than two different groups and one of the very interesting properties of the logistic regression is that it can be used to perform multiclass classification also, apart from the binary classification. The idea behind the multiclass classification lies on the technique of one versus all classification where at a time only one class is considered as a class of interest i.e., Positive class and rest of the class falls on the category of negative class. And the training of the logistic regression classifier is done individually for each class using the data set available for learning and for a new set of observation the probability of the observation belonging to all the different class is estimated and the class is selected in which the probability of the existence of the observation estimated is highest.

R Code with explanation

To demonstrate the implementation of logistic regression on R language let us consider a data set which lists the type of breast cancer depending upon various parameters & is obtained from the source http://archive.ics.uci.edu/ml and to be precise, it contains the data from measurements of digitized images of fine-needle aspirate of a breast mass and the corresponding report of its belonging to benign or malignant class. We will begin by importing the CSV data file into ‘wbcd’ data frame as

wbcd <- read.csv ( “wisc_bc_data.csv” , stringsAsFactors = FALSE )

The first several entries of data can be seen with the help of code

str(wbcd)

As many R machine learning classifier requires the target feature is coded as factor therefore we will recode the diagnosis variable using the levels B and M using the code

wbcd$diagnosis <- factor (wbcd$diagnosis, levels=c(“B”,”M”), labels=c(“Benign”, ”Malignant”))

Another important aspect which needs concern is the range of the data set belonging to different features because of the fact that estimated parameters uses adaptive updation which depends on the magnitude of the sample as well and if that differs significantly over the features then it can transmit significant affect over the convergence of the estimated parameters towards global mimima. Therefore normalization across the variable is to be done to eliminate the possibility of such adversity and it can be done using normalize function

normalize <- function(x) {

return((x-min(x))/(max(x)-min(x)))  }

wbcd_n <- as.data.frame( lapply( wbcd[1:31], normalize))

Having made the data set smoothly usable, the next step will be to divide the data into training and testing set in order to have a broader picture about the workings and this can be done as

wbcd_train <- wbcd_n [1:469,  ]

wbcd_test <- wbcd_n [470:569,  ]

Since we have got the training set now we can fit the logistic regression model using the command

model <- glm( diagnosis~.,family=binomial(link=’logit’),data=wbcd_train)

We can see the result of the model using command

summary(model)

The result hence obtained will show the statistical significance of the estimated parameter at different level of significance viz. 1%, 5%, 10% and with the help of this we can judge which parameter is significantly influencing the dependent variable and which is not, as the significance level less than 10% shows former conclusion.

In order to utilise the model parameters estimated above we can use directly the model on the unseen data set with the help of following code

results <- predict(model, newdata=wbcd_test[ ,2:5], type=’response’)

If we write type=’response’ as in the earlier code then the result will show the probability of data set belonging to the positive class and the choice of the threshold can be made as discussed in the above segment. Let us choose the threshold value to be 0.5 for this purpose then it is passed into the program with the help of following code

results <- ifelse( results > 0.5, 1, 0)

Therefore the result matrix will give the classification result as 1 for Malignant Class of tumour & 0 for Benign Class of tumour. Although because of the fact that we had original results for the test data set therefore we can exploit that to check out the performance of the model and this can be done using a simple performance metric known as accuracy and it can be calculated as

error <- mean(results != wbcd_test[ ,1])

accuracy=1-error

Generally higher value of accuracy indicates better performance. It is also worthy to note that there exists are other sophisticated metrics also, like precision, Recall, F-1 value which are used specially for skewed class, and other methods of validating the performance of the model like ROC curve.

Applications in Industry

Some of the industrial application of supervised classification using logistic regression along with the possible combination of features can be listed as

Assessment of the riskiness of bank loan issuance

One of the biggest concerns of the banks is the likelihood of getting the loan given by them to default and to resolve that the past bank records can be exploited to train the model and show indication in advance as the probability of getting defaulted and the data may contain features of issuing persons like checking balance, months loan duration, credit history, purpose, amount.

Assessment of the potential of new venture

Many angel investors and venture capitalist seeks for a promising new venture for financing. In order to classify based upon their potentiality the data set on the past records can be used which can contain features like pre-money valuation, market acceptance, scalability, Competitors, industrial background, pricing power, technological edge and entry barrier.

Medical Discipline – Diagnosis of tumour

A logistic regression algorithm can be used to automate the identification of tumour thereby reducing the need of fine-needle aspiration biopsy. Based on the past records and digitized imaging parameters the tumour observed can be classified as cancerous or noncancerous and the features regarding the tumour which can be included in the data set are radius, texture, perimeter, concavity, compactness, symmetry and fractal dimension.

Surveillance of Online transaction

A large number of fraudulent activities had been observed over the years on the internet which are accompanied by either stolen passwords or credit/debit card number. Therefore need of the surveillance system has evolved for online transactions and which can be accompanied by past data on features like amount and frequency of transaction, anomaly on the path by which transaction is made.

 

 

 

 

 

 

 

 

 

 

 

NO COMMENTS

LEAVE A REPLY