9 min read

Analyzing Bike Data with R

I moved to Columbus, Ohio in the fall of 2018 for work. One of the things that I really like about Columbus is the CoGo bike sharing network. CoGo currently has 597 bikes and 72 stations spread across the city. Recently it was announced that the network will be expanding in 2020 with additional stations and new e-bikes.

I bought an annual subscription shortly after I moved here. There’s a station close to my apartment and my work so it makes it easy for commuting. It’s also great because I hate driving and it makes it easy to get around the city without having to worry about parking.

If you have a membership CoGo makes it easy to get all your trip data from their website. (They also share real-time data and historical trip data of all riders. Data can be found here.) So I downloaded copied and pasted my personal data from their website to do some basic analysis on my riding habits.

Bike Riding Analysis

# load the necessary libraries
library(tidyverse)
library(janitor)

Reading in the data

# read in my riding data
cogo_data <- read_csv("https://raw.githubusercontent.com/trevinflick/blog/master/my_cogo_data.csv")

I also used Google Maps to get the coordinates of all the bike stations that I used.

# read in the station lookup table
cogo_lu <- read_csv("https://raw.githubusercontent.com/trevinflick/blog/master/station_lookup.csv")
# take a look at the data
glimpse(cogo_data)
## Observations: 214
## Variables: 6
## $ `Trip ID`       <dbl> 411343, 411344, 411358, 411374, 411382, 411418, 41144…
## $ `Start Station` <chr> "Easton Square Pl & Townsfair Way", "Seward St & Wort…
## $ `Start Time`    <chr> "1/5/20 11:11", "1/5/20 12:07", "1/6/20 8:06", "1/6/2…
## $ `End Station`   <chr> "Seward St & Worth Ave", "Easton Square Pl & Townsfai…
## $ `End Time`      <chr> "1/5/20 11:19", "1/5/20 12:13", "1/6/20 8:12", "1/6/2…
## $ Duration        <chr> "7m 28s", "5m 34s", "6m 30s", "4m 59s", "5m 52s", "5m…
glimpse(cogo_lu)
## Observations: 28
## Variables: 3
## $ STATION <chr> "3rd St & Sycamore St", "4th St & Rich St", "Bicentennial Par…
## $ LAT     <dbl> 39.94889, 39.95773, 39.95596, 39.95754, 39.97135, 40.04882, 3…
## $ LON     <dbl> -82.99519, -82.99536, -83.00313, -82.99853, -83.00217, -82.91…

Data cleaning

Now we’ll clean up the riding data so it’s easier to analyze.

# clean up the column names with the janitor package
cogo_data <- clean_names(cogo_data)
cogo_lu <- clean_names(cogo_lu)
# separate date and time into two columns
cogo_data <- cogo_data %>%
  separate(col = start_time, into = c("start_date","start_time"), sep = " ")

cogo_data <- cogo_data %>%
  separate(col = end_time, into = c("end_date","end_time"), sep = " ")
# create a column for trip time in seconds
cogo_data <- cogo_data %>%
  mutate(minutes = as.numeric(str_extract_all(cogo_data$duration, "[0-9]+", simplify = TRUE)[,1]),
         seconds = as.numeric(str_extract_all(cogo_data$duration, "[0-9]+", simplify = TRUE)[,2]),
         trip_seconds = minutes * 60 + seconds)
head(cogo_data)
## # A tibble: 6 x 11
##   trip_id start_station start_date start_time end_station end_date end_time
##     <dbl> <chr>         <chr>      <chr>      <chr>       <chr>    <chr>   
## 1  411343 Easton Squar… 1/5/20     11:11      Seward St … 1/5/20   11:19   
## 2  411344 Seward St & … 1/5/20     12:07      Easton Squ… 1/5/20   12:13   
## 3  411358 Lucas St & T… 1/6/20     8:06       Front St &… 1/6/20   8:12    
## 4  411374 Front St & B… 1/6/20     11:52      Lucas St &… 1/6/20   11:57   
## 5  411382 Lucas St & T… 1/6/20     12:57      Front St &… 1/6/20   13:03   
## 6  411418 Front St & B… 1/6/20     16:37      Lucas St &… 1/6/20   16:43   
## # … with 4 more variables: duration <chr>, minutes <dbl>, seconds <dbl>,
## #   trip_seconds <dbl>

Initial Analysis

Okay, now that we’ve cleaned up some of the columns we can dive into the data. Let’s take a look at my total number of trips.

cogo_data %>%
  group_by(trip_id) %>%
  nrow()
## [1] 214

I took 214 trips as a CoGo member, but I want to know what time frame we’re looking at. In order to work with dates, we’ll load the lubridate package.

library(lubridate)

cogo_data$start_date <- mdy(cogo_data$start_date)
cogo_data$end_date <- mdy(cogo_data$end_date)
cogo_data %>%
  summarise(first_day = min(start_date),
            last_day = max(start_date))
## # A tibble: 1 x 2
##   first_day  last_day  
##   <date>     <date>    
## 1 2018-11-29 2020-01-07

So it looks like I began riding on November, 29th 2018 and my last ride was on January, 7th 2020. Now I want to count the number of trips by year.

cogo_data %>%
  count(year(start_date))
## # A tibble: 3 x 2
##   `year(start_date)`     n
##                <dbl> <int>
## 1               2018    11
## 2               2019   193
## 3               2020    10

2019 Analysis

I took the bulk of my trips in 2019, so for the rest of the analysis we’ll focus on that year.

cogo_2019 <- cogo_data %>%
  filter(year(start_date) == 2019)

Plotting!

Let’s take a look at what days of the week and what months I tend to use CoGo.

cogo_2019$day <- weekdays(cogo_2019$start_date)
library(ggthemes) # for a pretty graph
days <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")

cogo_2019 %>%
  ggplot(aes(x = day)) +
  geom_bar() +
  scale_x_discrete(limits = days) +
  labs(x = "",
       y = "",
       title = "Most rides in 2019 occured on Saturday") +
  theme_fivethirtyeight()

start_month <- c("January","February","March","April","May","June","July","August","September",
                 "October","November","December")

cogo_2019 %>%
  ggplot(aes(x = month(start_date))) +
  geom_bar() +
  scale_x_discrete(limits = start_month) +
  labs(x = "",
       y = "",
       title = "May was the most popular month for 2019") +
  theme_fivethirtyeight(base_size = 9)

It makes sense that most of my rides occured in the spring and summer during the warmer months. It’s a little surprising that I rode 10 times in January (more than February and March combined).

Next I want to see how long I typically ride during each trip. For this exercise I’ll take a look at the minutes column. (Since I’m not including the seconds column we’ll have to think of the minutes as bins: 0-59 seconds for 0 minutes, 1:00-1:59 for 1 minute, 2:00-2:59 for 2 minutes, etc.)

cogo_2019 %>%
  ggplot(aes(x = minutes)) +
  geom_bar() +
  scale_x_continuous(breaks = seq(0,30,by=5)) +
  labs(x = "Trip duration in minutes",
       y = "",
       title = "Most trips are around 6 to 7 minutes") +
  theme_light()

A majority of my trips are under 10 minutes. With a CoGo membership you get unlimited trips under 30 minutes and any ride over 30 minutes you have to pay extra. In 2019 I had one trip over 30 minutes. Let’s take a look at this one a little closer.

cogo_2019 %>%
  filter(minutes > 30)
## # A tibble: 1 x 12
##   trip_id start_station start_date start_time end_station end_date   end_time
##     <dbl> <chr>         <date>     <chr>      <chr>       <date>     <chr>   
## 1  383502 Lane Ave at … 2019-06-29 10:13      North Bank… 2019-06-29 10:51   
## # … with 5 more variables: duration <chr>, minutes <dbl>, seconds <dbl>,
## #   trip_seconds <dbl>, day <chr>

Station Location Analysis

This trip lasted 37 minutes. I remember it because I ran 5 miles with my friend to Ohio State’s campus and then I biked back on the CoGo while he ran the rest of the way back (I’m a great friend).

So we looked at when I typically ride and how long the trips usually take, now I want to look into where I ride.

# top 10 starting stations
cogo_2019 %>%
  group_by(start_station) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(10)
## # A tibble: 10 x 2
## # Groups:   start_station [10]
##    start_station                   n
##    <chr>                       <int>
##  1 Lucas St & Town St             81
##  2 Front St & Beck St             51
##  3 Neil Ave & Nationwide Blvd     17
##  4 Summit St & 17th Ave            8
##  5 3rd St & Sycamore St            7
##  6 High St & Warren                3
##  7 Schiller Park - Stewart Ave     3
##  8 4th St & Rich St                2
##  9 Bicentennial Park               2
## 10 Columbus Commons - Rich St      2
# top 10 ending stations
cogo_2019 %>%
  group_by(end_station) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(10)
## # A tibble: 10 x 2
## # Groups:   end_station [10]
##    end_station                     n
##    <chr>                       <int>
##  1 Lucas St & Town St             92
##  2 Front St & Beck St             40
##  3 Neil Ave & Nationwide Blvd     14
##  4 3rd St & Sycamore St            8
##  5 Summit St & 17th Ave            8
##  6 High St & Warren                5
##  7 Bicentennial Park               4
##  8 Nationwide Arena - Front St     3
##  9 Schiller Park - Stewart Ave     3
## 10 Topiary Park - Town St          3
# top 10 A to B trips
cogo_2019 %>%
  group_by(start_station, end_station) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(10)
## # A tibble: 10 x 3
## # Groups:   start_station, end_station [10]
##    start_station              end_station                     n
##    <chr>                      <chr>                       <int>
##  1 Front St & Beck St         Lucas St & Town St             46
##  2 Lucas St & Town St         Front St & Beck St             38
##  3 Neil Ave & Nationwide Blvd Lucas St & Town St             15
##  4 Lucas St & Town St         Neil Ave & Nationwide Blvd     11
##  5 Lucas St & Town St         3rd St & Sycamore St            7
##  6 Lucas St & Town St         Summit St & 17th Ave            7
##  7 3rd St & Sycamore St       Lucas St & Town St              6
##  8 Summit St & 17th Ave       Lucas St & Town St              5
##  9 Lucas St & Town St         High St & Warren                3
## 10 Lucas St & Town St         Nationwide Arena - Front St     3

Here are my most popular starting and ending stations, as well as my most popular routes. My most popular stations are Front St & Beck St and Lucas St & Town St. This makes sense since these stations are closest to my apartment and to my work. This combination also happens to be my two most popular routes. The Neil & Nationwide station is my third most popular destination. I really liked to utilize this station for going to baseball games or concerts.

Mapping!

library(ggmap)

cogo_start_19 <- left_join(cogo_2019, cogo_lu, by = c("start_station" = "station"))
cogo_end_19 <- left_join(cogo_2019, cogo_lu, by = c("end_station" = "station"))

A map of my starting locations in 2019.

qmplot(lon, lat, data = cogo_start_19, maptype = "toner-background", geom = c("point","density2d"),
       color = I("red"))

A map of all my ending stations in 2019.

qmplot(lon, lat, data = cogo_end_19, maptype = "toner-background", geom = c("point","density2d"),
       color = I("red"))

Fin

Okay, that’s it for now. If you’re new to R hopefully was useful to you and you learned something along the way. Here are some awesome additional R resources to check out:

Another useful resource and community is Tidy Tuesday. It’s a great community of R learners that analyze a new data set every week. Check out the link and #TidyTuesday on Twitter.

Keep your eye out for more content on this site!