The Relationship Between Varying Environmental Conditions & The Usage Of Bike Rentals
Bike sharing system is a new mode of transport in which a rental service of bicycles are provided to the public from one service station to another with a payment made for a specified period of time, usually short periods. Studies have shown that the number of users of bike rental changes during the course of the year because of environmental factors. (Vogel, Greiser, & Mattfeld, 2011, p. 515) Given that, the aim of this report is to establish how environmental factors affect usage count of bike rentals looking at the predictor variables temperature, humidity, season, feeling temperature, wind speed.
For given data of year 2011 to 2012, a multiple logistic regression model was created to predict whether environmental factors have an impact on usage count of rentals, either they cause the usage count to be high or the usage count to be low. Papers have failed to show that seasonal change have high impact on the usage count of bike rentals, which this report will outline in more depth in the predictor variable season and other variables present.
Using the stepwise method in modelling the data, which is a combination of forward selection method and the backward selection method. Variables were entered into the model at significant level of 0.05, since a p-value of 0.05 gives a good fit model and variables entered must maintain a p-value of 0.05 to remain in the model. The method helped in finding the best predictor variables amongst the 11 predictor variables.
The model shows that the number of usage count is high during summer and spring compared to autumn and winter, suggesting a relationship between the usage count and the predictor variable season. Does this change causes other environmental factors to also impact the usage count? Studies have shown that bike usage count differ in all the seasons, given type of day or time of the day being mornings, afternoons and evenings.
Introduction
Bike rentals has become increasingly popular since they “provide the missing link between existing points of public transportation and desired destinations” (Vogel, Greiser, & Mattfeld, 2011, p. 514). These systems are used worldwide as an alternative means of transport to and from work, during holiday season as a recreational activity and various other ways. Bike rentals is an increasing option due to factors such as traffic, pollution and affordability. (El-Assi, Mahmoud, & Habib, 2015, p. 590) The operation of BSS is largely dependent on factors such as the built environment, weather conditions, location and distribution of bikes.
The aim of this project is to analyse the relationship between varying environmental conditions and the usage of bike rentals. More specifically, the aim is to look at the usage of bike rentals on working days. Our dataset presents us with data of users that are registered or casual, varying environmental factors according to temperature, humidity, wind speed, weather, and actual feel of temperature. Other variables such as time of year in months and seasons are also considered.
The dataset has 14 variables and 468 observations which allows for testing through SAS procedures. This paper will focus on analysing the data in order to establish whether environmental factors have an impact on usage of bike rentals. This will be done by firstly producing descriptive analysis of the dataset to determine the nature of the data. Then the process of creating a model that will test the relationship between number of users of the bike rentals and environmental changes by running a multiple logistic regression. The usage of graphs will be introduced to diagnose the model and find suitable remedies to be used in making the model more effective.
Challenges in bike rentals
In terms of operations, the bike rentals may encounter challenges in terms of location considering the kinds of people living, working and attending school in those locations. Bike rentals operational in locations close to campuses and business hubs may experience higher rental usage as opposed to its counterparts. (El-Assi, Mahmoud, & Habib, 2015, p. 606)Also, considering the built environment when setting up bike rental shops is very important. Users, whether casual or registered, are more likely to use bike rentals when appropriate infrastructure has been put in place like cyclists’ lanes for example. In the event of appropriate infrastructure is lacking, potential and current users of the bike rentals may be discouraged.
Influence of variables on rental usage
- Influence of type of day
- Influence of environmental factors
- Interaction between environmental factors and type of day and its influencе
The type of day has been divided into two, namely weekday and weekend. Statutory holidays have been included under the latter type. When considering weekdays, the usage was at a consistent level for registered users which is attributed to the daily work commute usage peaks during weekdays occur in the morning, noon and late afternoon. Whereas, there were only usage peaks found during the afternoon on the weekend.
An increase in usage, generally, was attributed to higher temperatures in summer and being limited to lower temperatures in spring. Also found by other authors is that temperatures that are too high results in a decrease in usage and not only for a decrease in temperature as data showed for usage trends during winter.
Other authors have conducted analysis on the effect of weather conditions and have found that temperature and time of year has the most significant influence on usage of the BSS. Furthermore, a consistent usage was found for weekdays by registered users, while there was a slight decrease in winter by casual users due to the lower temperatures and higher likelihood of injury.
Methodology
In this section the aim is to describe the methods used to analyse the data to be able to build a multiple logistic regression model that would predict accurately the effects of environmental factors on the usage of the bike-sharing system on a working day, based on usage count. Various procedures were performed in analysis in SAS, we conducted the analysis in the following order:
- De-normalising the data
- Test for normality
- Correlation analysis
- Regression model building
- Diagnostics
Some of the data in the dataset required that it be de-normalised. The variables that needed to be de-normalised were temperature, feeling temp, humidity and windspeed. By creating a new data set with modifications to each variable from the old dataset, to de-normalise the variables so that the model being created will fit the data and be able to make future predictions.
The procedure called proc univariate is useful to test for normality of variables by producing detailed descriptive statistics as well as graphs such as histograms, box-and-whisker and normal probability plots to support the values produced. Some examples of descriptive statistics included here are means, variances, skewness, and kurtosis. This list is not exhaustive.
This is a procedure in SAS called proc corr that allows us to check for a linear statistical relationship between variables. Target variable in the data set, usage count which is the total number of rental bikes including both registered and casual users, is the variable that will be able to explain how environmental factors affect bike rentals. The target variable is used to check correlation with all other variables in the dataset. From this, we were able to extract the variables which would be most useful in our modelling, that has high correlation with usage count.
During this analysis, the plotting of histograms, boxplots and scatter diagrams were very useful particularly in helping to gain a visual understanding about how the data is behaving when comparing the variables. A widely scattered result is not a useful indication of a linear relationship between the variables, and such a predictor variable would be regarded as not having a linear relationship with the target variable. However, a scatter plot with clustered points forming a straight line would be indicative of a strong linear relationship, (Kutner, Nachtsheim, Neter, & Li, 2005, p. 100) and such predictor variables would be regarded as significant and thus useful in giving more explanation on whether the usage count will be high or low, from these a multiple logistic regression model could be developed.
Multiple Logistic regression. Multiple logistic regression is a statistical procedure, used as proc reg in SAS, where the response outcomes takes only on two possible values either 0 or 1 and can be feather used in an occurrence of an event. This is done by looking at the odds ratios of the predictor variables, on whether a certain predictor variable will affect the usage count to have high or low usage of bike rentals. Since count is not binary and it is a count variable which happen to be a sum of casual users and registered users, a new variable was created in which count has two possible outcomes as events one event being high usage count and other event being low usage count.
A multiple logistic regression model was created to predict whether there is a high usage or low usage count of bike rentals, using a stepwise variable selection criterion. In which, the method managed to include variables in the model whose p-value is greater than 0.05, but once they are in the model they must maintain the same p-value of 0.05, to remain in the model. The significant value of 0.05 gives a good model, since it accumulates more observations from the end points.
The model was fitted using the maximum likelihood function to estimate the parameters of the model. Given by: loge L(β) = ∑Yi (Xi`β) - ∑ loge[1 + exp(Xi`β)] , 1≤ i ≤ nWhere: Xi`β = βo + βıXı +. + βρ-ıXρ-ı. The multiple logistic SAS procedure analyse the data, based on the maximum likelihood function to give the estimates of the parameters, in order to create a fitted logistic response function: π-hat= [1 + exp(-X`β)]ˉ¹. Where: X`β= bo + bıXı+…..+bρ-ıXρ-ı.Based on the fitted logistic function, test of lack of fit was carried out on the training set, based on Hosmer and Lemeshow chi-square test, which gave the chi-square value and the p-value, if chi-square value is less than the p-value it indicated a good fit for the model and can be used on the test set.
Often during predictive modelling, some issues may arise that causes the model to not be a good one to be used. However, such issues can be remedied when accurately diagnosed. These issues are called departures from a linear model, these include:
- Nonnormality of error terms
- Nonconstancy of error variances
- Nonindependence of error terms
- Nonnormality of regression function
- Existence of some outliers
- Omission of predictor variables
To be able to diagnose a departure of the multiple logistic regression model, we study the residuals which is a useful output that can be requested from SAS during regression modelling. Once an issue has been identified, an appropriate remedy is applied to the model and the model is rerun to check its aptness once more. This process is repeated until the model is appropriate. From the output, the model does not seem to have issues. However, when regressing the target variable against each predictor variable, temperature and feeling temperature seems to produce a curvilinear scatter plot. This means that we may need to transform those predictor variables or drop them from the model.
Conclusion
Since previous authors have done research regarding the challenges of infrastructure and distribution, we aim to follow up from those authors who have done research in the area of environmental factors. Specifically, the influence of environmental factors on a working day, whether this is includes the weekends and statutory holidays. Since other authors have identified temperature as the strongest influence in bike rental usage, we would follow from that and isolate it according to the type of days and not only the season in which the day is grouped (El-Assi, Mahmoud, & Habib).
Furthermore, the number of bike rentals that occur on a given day has been categorised according to the type of day, seasonal changes and temperature. This accounts for those interactions, but not the type of user in relation to those variables separately and collectively. In this report, we aim to predict the effect of environmental factors on a type of day, specifically a working day. This is done by creating a model that predicts the effects of various environmental factors on the count of usage rental.