Our data was hand-collected off of the Instagram page ocean.destinations. The data set contains 183 observations and 8 different variables including the money earned by the post, the number of accounts reached, the number of shares, the number of saves, the creator they worked with, the animal in the post (if any), the post type (photo or video), and finally the carousel count. The data contains a mix of quantitative and categorical variables, which gives us a mix of factors to work with.
Our main research questions are: 1) Is the amount earned just determined by reach? 2) Can we predict earnings solely based on variables known prior to posting? 3) Does reach really depend on the number of shares and media type?
First, we wanted to learn about whether the amount of money earned by a post is solely determined by its reach, which suggests we should examine the variables Reach Count, Money Earned, and how earnings vary across different reach categories.
library(readr)
library(ggplot2)
oceandata <- read_csv("~/Desktop/ocean_data.csv")
## Rows: 182 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): post_creator, animal_type, post_type
## dbl (4): money_earned, share_count, save_count, carosel_count
## num (1): reach_count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oceandata$animal_type[182] <- "stingray"
ggplot(oceandata, aes(x = reach_count, y = money_earned)) +
geom_point(alpha = 0.6, size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "firebrick") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format(prefix = "$")) +
labs(
title = "Earnings vs. Reach",
x = "Reach Count",
y = "Money Earned"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The first graph is a scatterplot of money made vs. reach, with a line of best fit. This shows a positive correlation between reach and money made. In general, as a post’s reach grows, money made also grows. The linear trendline shows a clear upward trend, so this suggests that reach is a strong variable to utilize in order to determine earnings. However, the spread of the points along the trendline suggests that there is a considerable amount of scatter in the data. There are low-reach posts that continue to make lots of money and high-reach posts that don’t necessarily equal the highest income. So reach is a decent indicator of income, but not the only factor. Other aspects, such as the quality of content, rates of engagement, or audience demographics, may also play a large role in earnings determination.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” stringr 1.5.1
## âś” forcats 1.0.0 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.1
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
oceandata <- read_csv("~/Desktop/ocean_data.csv",
col_types = cols(reach_count = col_character())) %>%
mutate(
reach_num = parse_number(reach_count),
reach_cat = ntile(reach_num, 3) %>%
factor(labels = c("Low reach
250k","Mid reach
250k-450k"
,"High reach
450k")),
earn_cat = ntile(money_earned, 3) %>%
factor(labels = c("Low earn < 25$","Mid earn $25-50$","High earn>50$"))
)
df_counts <- oceandata %>%
count(reach_cat, earn_cat) %>%
group_by(reach_cat) %>%
mutate(prop = n / sum(n)) %>%
ungroup()
ggplot(df_counts, aes(x = reach_cat, y = prop, fill = earn_cat)) +
geom_col(width = 0.7) +
geom_text(aes(label = scales::percent(prop, accuracy = 1)),
position = position_stack(vjust = 0.5),
color = "white", size = 4) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_fill_brewer(palette = "Set2", name = "Earnings Band") +
labs(
title = " Earnings by Reach Category",
x = "Reach Category",
y = "Percent of Posts"
) +
theme_minimal()
The second graph delves deeper into this question by dividing posts into three categories according to their reach, Low Reach (up to 250k), Mid Reach (250k–450k), and High Reach (more than 450k). It then displays the percentage of posts in each category that fall into three earnings brackets. The findings indicate that in the Low Reach category, most posts are in the Low Earn category, which earns less than $25. Among the Mid Reach group, earnings are more varied, yet still have a bias toward lower earnings. Among the High Reach group, an overwhelmingly large proportion of posts are in the High Earn category and earn over $50. This trend confirms the hypothesis that greater reach significantly enhances the chance of high earnings but also reveals that not every low- or mid-reach post is underperforming. There are still some rare pieces of content in these lower categories that make a lot of money.
In both of these charts, we can see that reach is a strong contributor to earnings, but there has to be other variables at work making success. Reach alone doesn’t really dictate how much earned. There are some outliers where posts make more or less than might be predicted by reach, but overall, higher reach massively increases the possibility of making more.
As part of our report, we aimed to predict the amount earned by a post using variables available before posting. While metrics like reach and shares are strong predictors, they can only be measured after a post goes live. Therefore, we focused on variables such as post_creator, animal_type, post_type, and carosel_count. We modeled these against log_money - the logorithm of money earned per post - to stabilize variance and reduce the impact of outliers, making the relationships more linear and reliable.
library(readr)
insta_data <- read_csv("~/Desktop/ocean_data.csv")
## Rows: 182 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): post_creator, animal_type, post_type
## dbl (4): money_earned, share_count, save_count, carosel_count
## num (1): reach_count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
insta_data$animal_type[182] <- "stingray"
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
insta_data$post_numeric <- ifelse(insta_data$post_type == "video", 1, 0)
insta_data$log_money <- log(insta_data$money_earned)
insta_data |>
dplyr::select(post_numeric, carosel_count, log_money) |>
ggpairs(
mapping = aes(color = factor(post_numeric), fill = factor(post_numeric), alpha = .5),
columns = c("post_numeric", "carosel_count", "log_money"),
upper = list(continuous = "cor"),
diag = list(continuous = "densityDiag"),
lower = list(continuous = "points")
) +
scale_colour_manual(values = c("darkorange", "cyan4")) +
scale_fill_manual(values = c("darkorange", "cyan4")) +
theme_bw() +
theme(strip.text = element_text(size = 7))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
For the numeric variables — post_numeric (a factorized version of
post_type) and carosel_count — we used a pairs plot to explore their
correlation with log_money. The plot showed that both variables have a
positive and significant relationship with log_money. Notably, video
posts have a much stronger signficiant correlation
(.351***
) with earnings compared to photo posts
(.166*
). Given their strong linear relationships and
significance (determined by *
= p <.05, ***
= p < .001), both variables were kept for the nonlinear regression
model.
library(lsr)
count <- table(insta_data$animal_type)
insta_data$animals <- ifelse(count[insta_data$animal_type] >= 5, insta_data$animal_type, "Other")
anova_animal <- aov(log_money ~ animals, data = insta_data)
summary(anova_animal)
## Df Sum Sq Mean Sq F value Pr(>F)
## animals 13 11.35 0.8735 1.98 0.0252 *
## Residuals 168 74.13 0.4412
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
etaSquared(anova_animal)
## eta.sq eta.sq.part
## animals 0.1328358 0.1328358
anova_creator <- aov(log_money ~ post_creator, data = insta_data)
summary(anova_creator)
## Df Sum Sq Mean Sq F value Pr(>F)
## post_creator 88 48.14 0.5471 1.363 0.071 .
## Residuals 93 37.34 0.4015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
etaSquared(anova_creator)
## eta.sq eta.sq.part
## post_creator 0.5632194 0.5632194
Before running the ANOVA and calculating eta squared, we first addressed some issues from high cardinality by pruning the variables. This was done by combining infrequent categories into a new “Other” value when a category had very few occurrences. For example, animal_type[“shrimp”] only appeared once, which could hurt the significance of animal_type because regression models waste degrees of freedom fitting to rare, uninformative categories. After consolidating rare categories into “Other” to create the new variable animals, the ANOVA test revealed a statistically significant relationship with log_money (.0252). For post_creator, however, creating an “Other” category wasn’t necessary, as further EDA showed it was already well-structured. While the p-value for post_creator was slightly above the usual 0.05 threshold (at 0.071), its high eta squared value indicated a large effect size, justifying its inclusion in the final model.
After determining that all the predictor variables were significant in explaining meaningful variability, we constructed the nonlinear regression model using a span of 0.6 to account for the smaller dataset. Since all selected variables were included, building the model simply involved adding them into the nonlinear fit.
insta_data$animals <- as.factor(insta_data$animals)
insta_data$post_creator <- as.factor(insta_data$post_creator)
X <- model.matrix(~ animals + post_numeric + post_creator + carosel_count, data=insta_data)[,-1]
y <- insta_data$log_money
model <- lm(y ~ X)
insta_data$predicted <- predict(model)
library(ggplot2)
insta_data$point_type <- "Actual Earnings"
insta_data|>
ggplot(aes(x = predicted, y = log_money)) +
geom_point(aes(color = point_type), alpha = 0.6) +
geom_smooth(aes(color = "Nonlinear Trend"), method = "loess", linewidth = 1.2, span = 0.5) +
scale_color_manual(
name = "Legend",
values = c("Actual Earnings" = "cyan4", "Nonlinear Trend" = "darkorange")
) +
labs(
title = "Nonlinear Regression vs Actual Earnings",
x = "Log Predicted Earnings",
y = "Log Actual Earnings"
) + theme(
legend.title = element_text(size = 7),
legend.text = element_text(size = 7),
legend.key.size = unit(0.6, "lines"),
legend.position = "right",
legend.box.margin = margin(3, 3, 3, 3)
)
## `geom_smooth()` using formula = 'y ~ x'
Model Summary Statistics
-Residual standard error: 0.5693 on 78 degrees of freedom
-Multiple R-squared: 0.7042
-Adjusted R-squared: 0.3137
-F-statistic: 1.803 on 103 and 78 DF
-p-value: 0.003419
summary(model)
##
## Call:
## lm(formula = y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0972 -0.1831 0.0000 0.1392 1.1748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.690971 0.743523 3.619 0.000523
## Xanimalscrab -0.555548 0.477259 -1.164 0.247956
## Xanimalseel -0.028940 0.617952 -0.047 0.962767
## Xanimalsfish -0.250737 0.390443 -0.642 0.522635
## Xanimalsjellyfish 1.241657 0.678778 1.829 0.071183
## Xanimalsnone 0.160623 0.411923 0.390 0.697647
## XanimalsOther 0.324926 0.348385 0.933 0.353872
## Xanimalspenguin -0.556185 0.492362 -1.130 0.262096
## Xanimalspolar_bear 0.113621 0.470918 0.241 0.809975
## Xanimalsseal -0.339488 0.356016 -0.954 0.343244
## Xanimalsshark -0.017927 0.418487 -0.043 0.965940
## Xanimalsstingray 0.405258 0.455091 0.890 0.375937
## Xanimalsturtle 0.666067 0.475284 1.401 0.165059
## Xanimalswhale 0.202776 0.436273 0.465 0.643375
## Xpost_numeric 0.741014 0.235550 3.146 0.002344
## Xpost_creatorabyssdivecenteramed 0.496394 0.923018 0.538 0.592250
## Xpost_creatoralbertdrosphotography -0.022474 0.698749 -0.032 0.974425
## Xpost_creatoralex_dawson_photography 0.657473 0.677250 0.971 0.334650
## Xpost_creatorally.photog 0.220894 0.870164 0.254 0.800277
## Xpost_creatorash.aboulfadl 0.099989 0.896120 0.112 0.911444
## Xpost_creatorbenjhicks -0.384841 0.927610 -0.415 0.679373
## Xpost_creatorbertiegregory 0.134112 0.851943 0.157 0.875321
## Xpost_creatorbeyer_photo -0.478955 0.955074 -0.501 0.617443
## Xpost_creatorbreeesea 0.630344 0.898821 0.701 0.485202
## Xpost_creatorbrianskerry 0.277879 0.903952 0.307 0.759354
## Xpost_creatorcainedelacy -0.906558 0.889250 -1.019 0.311135
## Xpost_creatorcaptaindannyfrank -0.537179 0.896120 -0.599 0.550610
## Xpost_creatorcooperlost 1.218350 0.958744 1.271 0.207585
## Xpost_creatordanielkordan 0.043162 0.601708 0.072 0.942998
## Xpost_creatordave_dives -0.074324 0.737173 -0.101 0.919949
## Xpost_creatordavey_roger -1.118538 0.863241 -1.296 0.198886
## Xpost_creatordaviddiez -0.753979 0.934195 -0.807 0.422069
## Xpost_creatordavidochoapt -0.887956 0.751814 -1.181 0.241158
## Xpost_creatordylan.dehass -0.146883 0.743203 -0.198 0.843845
## Xpost_creatoredgar_pacific_photography 0.344163 0.900859 0.382 0.703471
## Xpost_creatorflorian 0.273644 0.801713 0.341 0.733777
## Xpost_creatorgeorgekarbus -0.375923 0.775041 -0.485 0.629010
## Xpost_creatorhenelyspiers 0.795833 0.908721 0.876 0.383843
## Xpost_creatoribrahim.elharirys -0.144703 0.887493 -0.163 0.870903
## Xpost_creatorjackmcnn -1.479181 0.880017 -1.681 0.096792
## Xpost_creatorjacobguy.media -0.014536 0.913210 -0.016 0.987341
## Xpost_creatorjakemasondiving 0.135287 0.784461 0.172 0.863524
## Xpost_creatorjasonthejaw 0.734611 0.910814 0.807 0.422382
## Xpost_creatorjchauser 0.654473 0.905642 0.723 0.472049
## Xpost_creatorjferraragallery 0.009755 0.872753 0.011 0.991111
## Xpost_creatorjordan_robins -0.778199 0.770305 -1.010 0.315500
## Xpost_creatorjoshg_photos 0.448185 0.838884 0.534 0.594679
## Xpost_creatorjuansharks 0.232578 0.880017 0.264 0.792255
## Xpost_creatork_ane_ -0.540368 0.896120 -0.603 0.548252
## Xpost_creatorkristianlainephotography -0.108557 0.885563 -0.123 0.902751
## Xpost_creatorkristimakusha 1.006517 0.841324 1.196 0.235185
## Xpost_creatorkylesoto -0.494333 0.868143 -0.569 0.570711
## Xpost_creatorloriannah 0.322351 0.795041 0.405 0.686256
## Xpost_creatormai_enjooy -0.130292 0.713259 -0.183 0.855530
## Xpost_creatormarkian.b 0.439044 0.717722 0.612 0.542503
## Xpost_creatormarynadb -0.815315 0.766083 -1.064 0.290492
## Xpost_creatormaster.blaster 0.593467 0.751536 0.790 0.432113
## Xpost_creatormathis_delbono 0.225352 0.805135 0.280 0.780300
## Xpost_creatormaxijonas -0.712155 0.806538 -0.883 0.379961
## Xpost_creatormaxwelhohn -0.475729 0.756455 -0.629 0.531257
## Xpost_creatormichaelboyyd 0.071092 0.834516 0.085 0.932329
## Xpost_creatormikecoots -0.010275 0.864335 -0.012 0.990545
## Xpost_creatormitchellpettigrew -1.137343 0.838603 -1.356 0.178937
## Xpost_creatormoodhumeehaa -0.293115 0.893558 -0.328 0.743767
## Xpost_creatormoore_rachel -0.336438 0.916960 -0.367 0.714683
## Xpost_creatormr.seaoctopus -0.938452 0.889250 -1.055 0.294533
## Xpost_creatormultiple 1.007119 0.929583 1.083 0.281965
## Xpost_creatormyphilbaraequestrianlife 0.414623 0.888429 0.467 0.642021
## Xpost_creatornatsoffthegrid 0.372774 0.800607 0.466 0.642789
## Xpost_creatornohumanz_ -1.564035 1.071788 -1.459 0.148505
## Xpost_creatoroceanraysphotography -0.265260 0.870017 -0.305 0.761261
## Xpost_creatorpaulnicklen -0.050157 0.716235 -0.070 0.944350
## Xpost_creatorpelayosalinas 0.688191 0.771588 0.892 0.375182
## Xpost_creatorpietvandenbemd 0.480712 0.934930 0.514 0.608587
## Xpost_creatorquarterlifecrisis.travel -0.305041 0.745049 -0.409 0.683351
## Xpost_creatorrafa.mesquita -0.776244 0.778587 -0.997 0.321852
## Xpost_creatorrafafdezjr 0.066453 0.617513 0.108 0.914579
## Xpost_creatorreedplummerimages 0.578441 0.896045 0.646 0.520466
## Xpost_creatorrobertsduran 0.073252 0.750152 0.098 0.922461
## Xpost_creatorroiegalitz -0.129396 0.742663 -0.174 0.862134
## Xpost_creatorromanhurghada 0.911583 1.009942 0.903 0.369513
## Xpost_creatorroneydives -0.016089 0.812321 -0.020 0.984249
## Xpost_creatorrosslongphoto 0.596785 0.684021 0.872 0.385632
## Xpost_creatorsdmdiving 0.337047 0.770862 0.437 0.663150
## Xpost_creatorseanscottphotography -0.669393 0.896120 -0.747 0.457316
## Xpost_creatorshanegrossphoto 0.191102 0.862625 0.222 0.825254
## Xpost_creatorsideytheshark -0.214951 0.710930 -0.302 0.763188
## Xpost_creatortammy_ascher -0.108096 0.723135 -0.149 0.881559
## Xpost_creatorthe_story_of_a_biologist -1.755577 1.077266 -1.630 0.107206
## Xpost_creatorthemilkywaychaser 1.380539 0.889250 1.552 0.124597
## Xpost_creatorthurstonphoto -0.386716 0.666643 -0.580 0.563522
## Xpost_creatortimmboslice_ -0.800182 0.898283 -0.891 0.375781
## Xpost_creatortobiasvisuals 0.257365 0.838494 0.307 0.759709
## Xpost_creatortomcannonphotography -0.366320 0.730909 -0.501 0.617653
## Xpost_creatorundersea_gameqmi 0.213170 0.805605 0.265 0.792009
## Xpost_creatorunderwater_hadrien -0.285902 0.781792 -0.366 0.715578
## Xpost_creatorunderwater_images 0.258685 0.784989 0.330 0.742631
## Xpost_creatoruw.ocean.visuals -0.031145 0.727355 -0.043 0.965954
## Xpost_creatorwandering.westerner 0.651782 0.896120 0.727 0.469197
## Xpost_creatorwildvisuals131 -0.241644 0.896045 -0.270 0.788119
## Xpost_creatorwilliamdrumm 0.395082 0.682608 0.579 0.564404
## Xpost_creatoryuki.ando.37 -0.473495 0.896045 -0.528 0.598703
## Xpost_creatoryves_adams 2.068139 0.934084 2.214 0.029744
## Xcarosel_count 0.049935 0.030317 1.647 0.103567
##
## (Intercept) ***
## Xanimalscrab
## Xanimalseel
## Xanimalsfish
## Xanimalsjellyfish .
## Xanimalsnone
## XanimalsOther
## Xanimalspenguin
## Xanimalspolar_bear
## Xanimalsseal
## Xanimalsshark
## Xanimalsstingray
## Xanimalsturtle
## Xanimalswhale
## Xpost_numeric **
## Xpost_creatorabyssdivecenteramed
## Xpost_creatoralbertdrosphotography
## Xpost_creatoralex_dawson_photography
## Xpost_creatorally.photog
## Xpost_creatorash.aboulfadl
## Xpost_creatorbenjhicks
## Xpost_creatorbertiegregory
## Xpost_creatorbeyer_photo
## Xpost_creatorbreeesea
## Xpost_creatorbrianskerry
## Xpost_creatorcainedelacy
## Xpost_creatorcaptaindannyfrank
## Xpost_creatorcooperlost
## Xpost_creatordanielkordan
## Xpost_creatordave_dives
## Xpost_creatordavey_roger
## Xpost_creatordaviddiez
## Xpost_creatordavidochoapt
## Xpost_creatordylan.dehass
## Xpost_creatoredgar_pacific_photography
## Xpost_creatorflorian
## Xpost_creatorgeorgekarbus
## Xpost_creatorhenelyspiers
## Xpost_creatoribrahim.elharirys
## Xpost_creatorjackmcnn .
## Xpost_creatorjacobguy.media
## Xpost_creatorjakemasondiving
## Xpost_creatorjasonthejaw
## Xpost_creatorjchauser
## Xpost_creatorjferraragallery
## Xpost_creatorjordan_robins
## Xpost_creatorjoshg_photos
## Xpost_creatorjuansharks
## Xpost_creatork_ane_
## Xpost_creatorkristianlainephotography
## Xpost_creatorkristimakusha
## Xpost_creatorkylesoto
## Xpost_creatorloriannah
## Xpost_creatormai_enjooy
## Xpost_creatormarkian.b
## Xpost_creatormarynadb
## Xpost_creatormaster.blaster
## Xpost_creatormathis_delbono
## Xpost_creatormaxijonas
## Xpost_creatormaxwelhohn
## Xpost_creatormichaelboyyd
## Xpost_creatormikecoots
## Xpost_creatormitchellpettigrew
## Xpost_creatormoodhumeehaa
## Xpost_creatormoore_rachel
## Xpost_creatormr.seaoctopus
## Xpost_creatormultiple
## Xpost_creatormyphilbaraequestrianlife
## Xpost_creatornatsoffthegrid
## Xpost_creatornohumanz_
## Xpost_creatoroceanraysphotography
## Xpost_creatorpaulnicklen
## Xpost_creatorpelayosalinas
## Xpost_creatorpietvandenbemd
## Xpost_creatorquarterlifecrisis.travel
## Xpost_creatorrafa.mesquita
## Xpost_creatorrafafdezjr
## Xpost_creatorreedplummerimages
## Xpost_creatorrobertsduran
## Xpost_creatorroiegalitz
## Xpost_creatorromanhurghada
## Xpost_creatorroneydives
## Xpost_creatorrosslongphoto
## Xpost_creatorsdmdiving
## Xpost_creatorseanscottphotography
## Xpost_creatorshanegrossphoto
## Xpost_creatorsideytheshark
## Xpost_creatortammy_ascher
## Xpost_creatorthe_story_of_a_biologist
## Xpost_creatorthemilkywaychaser
## Xpost_creatorthurstonphoto
## Xpost_creatortimmboslice_
## Xpost_creatortobiasvisuals
## Xpost_creatortomcannonphotography
## Xpost_creatorundersea_gameqmi
## Xpost_creatorunderwater_hadrien
## Xpost_creatorunderwater_images
## Xpost_creatoruw.ocean.visuals
## Xpost_creatorwandering.westerner
## Xpost_creatorwildvisuals131
## Xpost_creatorwilliamdrumm
## Xpost_creatoryuki.ando.37
## Xpost_creatoryves_adams *
## Xcarosel_count
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5693 on 78 degrees of freedom
## Multiple R-squared: 0.7042, Adjusted R-squared: 0.3137
## F-statistic: 1.803 on 103 and 78 DF, p-value: 0.003419
Using the model’s predictions, we created a plot where blue dots represent the actual log earnings of each post (Y-axis) compared to the predicted log earnings (X-axis), with a nonlinear trend line (LOESS with a span of 0.6) running through the data. The orange trend line captures the overall relationship, while the shaded gray area shows the confidence interval. The plot clearly shows that there is a meaningful relationship between the variables and post earnings, which suggests some predictive power. However, there are a few big caveats. Although the model’s multiple \(R^2\) is 0.7042, the adjusted \(R^2\) drops sharply to 0.3137, indicating substantial overfitting which is consistent with the 103 predictors used to fit the model. The residual standard error of 0.5693 suggests that, on average, predictions are off by about 14% of the total range of log_money (4.0543), which is not ideal in practice. Nonetheless, the F-statistic is significant (p = 0.003), and the adjusted \(R^2\) is meaningfully better than simply predicting the mean, which offers some value for post optimization. While it is possible to predict the amount earned by a post using pre-posting variables with better accuracy than the average, the model is primarily useful as a rough heuristic rather than a precise forecasting tool.
Our third and final research question aims to explore the relationship between reach, post type, and number of shares. As we found through our earlier questions, reach is one the most important factors when it comes to earnings. Exploring the kind of media we post as well as the way accounts are reached through shares can give us more insight.
library(tidyverse)
posts <- read_csv("~/Desktop/ocean_data.csv")
## Rows: 182 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): post_creator, animal_type, post_type
## dbl (4): money_earned, share_count, save_count, carosel_count
## num (1): reach_count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(ggplot2)
library(dplyr)
posts <- posts %>%
mutate(share_level = cut(share_count,
breaks = quantile(share_count, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE))
ggplot(posts, aes(x = share_level, y = reach_count, fill = post_type)) +
geom_violin(trim = FALSE, position = position_dodge(width = 0.8)) +
stat_summary(fun = "median", geom = "point",
position = position_dodge(width = 0.8),
size = 2, color = "black") +
scale_y_log10(labels = scales::comma) +
labs(title = "Distribution of Reach by Share Level and Post Type",
x = "Share Level (Equal-Size Bins by Share Count)",
y = "Reach (log scale)",
fill = "Post Type")
Since reach count has really large values, it was difficult to see the
shape of the violin plot. Using the log scale helped us visualize this
better. Second, we created bins for share level based on share count, so
for example low contains the first 1/3 of the range of shares. Based off
of this plot, we can see that video posts consistently show higher
median reach than photo posts across all share levels. The spread of
reach is wider at higher share levels, especially for videos. Notably,
Low share photo posts exhibit a greater risk of extremely low reach.
posts |>
ggplot(aes(x = share_count, y = reach_count, color = post_type)) +
geom_point(alpha = 0.7) +
scale_y_log10(labels = scales::comma) +
scale_x_log10(labels = scales::comma) +
geom_density2d() +
labs(title = "Relationship Between Share Count and Reach (log scale x and y axis)",
x = "Share Count",
y = "Reach",
color = "Post Type")
The contour plot shows that video posts generally achieve higher reach
for a given share count compared to photo posts, as their contour lines
are positioned higher on the y-axis. This suggests videos have a
stronger viral potential, likely due to higher engagement or algorithmic
favorability. However, the significant overlap between the two post
types indicates that photos can still perform comparably to videos in
certain cases—particularly at lower share counts. This overlap implies
that while videos may have an edge, photo content remains a viable
option for high reach. The results highlight the importance of testing
both formats to optimize campaign strategies.
In terms of future work, we could apply the findings of this project to see if future posts earn more than the ones in this data set. The questions answered by this project allow for applications toward future growth of the instagram page, and so the nature of this project and our questions is to leave area for growth.