Sophia.org Exploring and Analyzing Data

» Posted by on Dec 3, 2011 in Fall 2011 | 0 comments

Sophia.org 

Project Part 3: Exploring and Analyzing Data (40 points)

December 12, 2011

How did you come up with the idea? (2 points)

Computer enhanced learning has increased over the years. Currently online college courses account for close to 30% of student enrollment and has experience an average annual growth of over 10% per year since 2002 (http://sloanconsortium.org). While traditional higher education institutions have grown in average by 2% per year, including a growth of 1.2% from the fall of 2008 to the fall of 2009, in contrast online education grew by 21.1% from the fall of 2008 to the fall of 2009 alone.  Not only is formal online education growing, but information online education has also grown at a rapid pace. YouTube was created in 2005 by three former PayPal employees and it rapidly transformed into the second largest search engine experiencing a rate of growth of 65,000 new videos every 24 hours. While many of these videos are only seen by a handful of people, others are seen by hundreds of thousands and millions of users. Independent educators have seen decided to create educational videos to share freely online for anyone to use and benefit from their teaching methods and experience. One of the best known examples of these independent teaching initiatives has been Khan’s Academy by Salman Khan. Over the period of a few years Salman has created over two thousand videos covering hundreds of subjects and having thousands of viewers per video. Khan’s Academy has seen partnered with the Gates Foundation and was one of the inspirations for Don Smithmier, the CEO of Sophia for leaving his position as VP of Capella University to fund Sophia.org as an online educational packet sharing site were any individuals could develop an education packet covering an accepted educational topic, which then could be used by anyone over the internet and rated for quality by users and other instructors. Sophia.org potential is to decentralize and personalize learning by creating a site were hundreds of inspired individuals can openly share their knowledge and create educational packets.

 

Sophia.org, located in Minneapolis, Minnesota, provides students all around the state and the world with the possibility of improving both their education and teaching skills by using and creating educational packets. A site such as Sophia, where developers can create educational packets that include videos, PowerPoints, essays, among other resources is now possible partly because of these recent technological changes and the implications of the continuously decreasing costs of processing power and computer size per dollar, the possibilities to which information and communication (ICT) technologies. As the price per calculation further decreases, and programmers build on the experience of other programmers, particularly when they are building on an openly available Open Source Software project, it becomes difficult to predict how these initiatives will impact education. There are many projects such as Sophia that are increasingly being used by learners and educators. So far, it is too early to tell which of these projects or companies will become the dominant players in this rapidly expanding market. Even perhaps more difficult that knowing which of these projects will become the next Dot Com success or Dot Com bust, is to measure what impact these initiatives are having in the educational achievement of its users. Because of the lack of understanding of their impact, it is important to further study and analyze these projects. This paper attempts to take on part of that challenge by helping to finding out more additional information about Sophia.org. It is important to better understand what is the appeal this site and learn a bit more about how it is being used by teachers and learners. While many questions would be best answered through a qualitative analysis, and through the use of other methods such as surveys, interviews, focus groups, among others forms of data collection, this paper looks at the quantitative data about the site and its resources is openly available on the internet and asks how this data can also help in improve our understanding of Sophia. Sophia was chosen because of my interest in the general area of study as well as due to the geographical proximity of Sophia. I had the opportunity of visiting Sophia.org earlier this semester to learn more about their project.

 

  • What are your hypotheses (e.g., how you think the two samples will compare)?  Please write these hypotheses out in words AND using appropriate symbols.  You should include both a null and alternative hypothesis.      (2 points)

 

Using primarily data that is available to anyone that visits the site, this paper, compares 60 educational packets from two categories (Humanities and Sciences) that can be found by users visiting Sophia.org. These categories were created to guide both users and creators to where they can better locate the packet they are searching for. Sophia.org employs a taxonomy organizational system, which among many others includes 10 different major categories, two of which are Humanities and Sciences. While the project originally planned to compare Sophia.org with a different organization such as UDemy.com, Wikieducator.org or AcademicEarth.com, for the purposes of this assignment, the data gathered and compared is all from Sophia.org and was obtained by selecting two of the main categories at random and then comparing the popularity or the number of visits in a month Science packets had in comparison to humanities packets. The null hypothesis for this study is that both packets have in average the same number of visitors per month or H0: µ = µ0  (Null Hypothesis) which is to say that difference in the number of visitors between both samples is not statistically significant. In contrast, the alternate hypothesis argues that in average the number of visitors per month to Science packets is greater than the number of visits to Humanities packets or that vs. Ha: µ1 > µ0 (One-sided – Alternate Hypothesis). Since the packets are exclusive from each other, the samples are independent. To find the relationship I therefore used SPSS to calculate the p-value from a one-sided independent t-test.

 

  • What is the reason for your hypotheses (e.g., why do you think the samples will differ in the way you predict)?  (2 points)

 

After randomly deciding on two categories to compare, I access the list of resources that were classified under each of the two categories. After a basic overview of the data, it appeared that the Science packets were reviewed more often, and visited more than the humanities packets. While there is no way of know whether the Science packets began to be developed first and therefore had a higher total number of viewers per packet or if one only a few packets that were very popular and had a large number of viewers were skewing the results, it was likely that in addition to the Science packets having in average a greater number of viewer per packet (statistic that has been collected by the site for each packet as soon as it is posted) that they also had a significantly higher number of viewers or users per month. While this project could not make comparison using the historical data and the average growth rate of the resources of each category it could collect information from a sample of resources of both categories and conduct and statistical test to find if there is a statistically significant difference in the degree of use of these two packet categories during the month the statistics were collected. After calculating this difference, if the alternate hypothesis is supported by the data comparison, it could therefore be argued that there is a greater demand or need for educational packets in the Sciences than there is in the Humanities? While after collecting the data, it appeared that this difference was there, but an statistical t-test was needed to determine the extent of this relation.

 

  • How did you gather your data?  (2 points)

 

The data was gathered over a period of a month. While the project originally planned on collecting data from two different websites or projects and then comparing the usage of a similar type of educational packet from two different places (Use of history packets in UDemy.com vs. Sophia.org vs. AcademicEarth.com), after attempting to collect comparable data for multiple sites for a number of weeks, in the end, the data shared to the public by different programs was too different for an statistical comparison to be possible. Whereas UDemy.com focused on the number of registered users for a course, and some of the courses were only accessible for a price, Sophia.org does not require the user to register to use the resource and it focuses instead on the number of people who visit the webpage of the educational packet. In the case of Sophia.org a visitor may not intend to use the packet but may be instead reading to find out if it’s useful, whereas because of the need to register to use certain resources within UDemy.com, the user is more likely to use it. Because of these differences, as well as similar differences of other programs, this study decided to focus only on Sophia.

 

Once that decision was made, obtain the data required accessing the site and selecting the two categories then adding to a spreadsheet information about the top 30 packets that appeared when clicking on the category. It is possible that the website has an algorithm that determines which resources are displayed first. I was unable to therefore randomly select the samples. However, since the same criteria was used for selecting the Sciences and the Humanities packets the samples are therefore comparable. They also represent only a sample of what is available for each category. Once the 30 resources on each category were selected, I entered the resource and recorded a series of variables including: type of packet, packet name, packet URL, number of packet views, number of packet shares, number of packet followers, packet rating, number of packet ratings, copyright license used for the resource, Sophia user score of the packet creator, number of follower to the creator of the packet, and the number of people the creator of the packet is following.

 

Within these, various variables were collected in two occasions, collecting a value for September and a value for October. To improve the quality and reliability of the data the data for a whole month with be collected over a number of hours in a single day. The variables that were collected twice for the purposes of a future comparison were: number of packet views (Sept, Oct), number of packet shares (Sept, Oct), packet rating (Sept, Oct), and the average number of ratings (Sept, Oct). As mentioned before, once the packets were selected, gathering the data was single as the data is openly available.

Graphs and Descriptive Statistics (10 points)

Statistics for Science Packets:

 

 

OVERALL SHAPE: The three curves are unimodal, yet they do not resemble normal distributions. The three curves are skewed to the right. However, it seems that the science packet curves have a broader range than the humanities curve. The shape can be visualized in greater detail in the graphs below. The shape can be considered to be closer to a reverse j-curve than a normal distribution. This is probably because of the low entry cost of producing an educational packet. There is a large quantity of packages but there many of them have a small number of viewers.

 

 

Sept Packet Views

N Valid

30

Missing

0

Mean

615.40

Std. Error of Mean

86.711

Median

435.50

Mode

362

Std. Deviation

474.937

Variance

225565.076

Range

2178

Minimum

20

Maximum

2198

Sum

18462

 

 

Shape: The curve is unimodal. It could overtime look more as a normal distribution curve, yet because based on current trends it will likely increasingly resemble a reverse j-curve. If packets that were not visited were deleted over time, then the distribution would likely be different. Is skewed to the right.

Center / Location: It has a mean value of 615 which is to say that according to the statistics collected in September, out of the top 30 packets, the average number of views per packet was 615. The median or the value which is in between half of the value (half greater, half smaller) was 435.50 which indicates that there are some outliers to the right that are contributing to the skewness of the curve.

Spread / Variability: The curve has a range of 2178 and a standard deviation of 474.937. While the range includes over a 2000 viewers value difference, most values are located closer to the mean or a value of 615.

 

 

Oct Packet Views

N Valid

30

Missing

0

Mean

803.33

Std. Error of Mean

102.495

Median

607.00

Mode

97a

Std. Deviation

561.387

Variance

315155.678

Range

2306

Minimum

97

Maximum

2403

Sum

24100

 

 

Shape: This curve is also unimodal. The curve is also skewed to the right, yet over the past month there seems to have been an increase in the range and an increase in the mean. Interestingly, there is also an increase in the standard deviation.

Center / Location: Unlike I originally predicted, this curve seems to more closely resemble a standard deviation and may be an indication of future trends, yet the mean (803.33) and the median (607) are very far from the maximum value.

Spread / Variability: It has a greater range than the previous month. The range is now of over 2300 views. The standard deviation also increased to 561.387 from 474.93. Along with this change, the variable has as expectedly increased.

 

 

 

Sept – Oct Views Comp

N Valid

30

Missing

0

Mean

187.9333

Std. Error of Mean

24.56719

Median

142.5000

Mode

25.00a

Std. Deviation

134.56006

Variance

18106.409

Range

508.00

Minimum

25.00

Maximum

533.00

Sum

5638.00

 

Shape: When comparing the differences or the growth from September to October in views of science related videos, the curve continues to be skewed but the values are much smaller. It looks close to a standard curve and less like a reverse j-curve, a change that was noticed when analyzing the October views curve after analyzing the September views curve. The curve is once again skewed to the right. The curve is unimodal.

Center / Location: The center of the curve can be determined by the mean or median. As with the other two graphs the mode was not a relevant value since in two of the graphs no values were repeated.  The mean is of 187.93 while the median is of 142.5.

Spread / Variability: Unlike the other two graphs, the range of this graph is substantially smaller. Since it only represents the viewers over a month this is to be expected. The range is barely over 500 viewers. There is a much smaller variance than the other two graphs for Science packets, and the standard deviation is also much smaller, having a value of 134.56

 

Statistics for Humanities Packets:

 

Sept Packet Views

N Valid

30

Missing

0

Mean

383.50

Std. Error of Mean

132.556

Median

104.00

Mode

30a

Std. Deviation

726.039

Variance

527132.672

Range

3733

Minimum

4

Maximum

3737

Sum

11505

 

 

Shape: While there are also 30 packets for the humanities graph, the graph is substantially more skewed (to the right). One outlier is having a major impact on the spread of the graph. This greatly affects the shape as the curve also resembles a reverse j-curve. The curve is unimodal. It does not resemble a normal distribution.

Center / Location: For this graph the median is a better indicator of the center than the mean. The use of the median as the most accurate center is due to the degree to which this curve is skewed. This curve has a mean of 383.5 and a median value of 104.

Spread / Variability: The variance of this curve is greater than the variance of any of the Science packet curves. This curve has a range of 3733 with a maximum value of 3737 and a large standard deviation of 726 despite half of the values being below 104.

 

 

 

 

Oct Packet Views

N Valid

30

Missing

0

Mean

481.30

Std. Error of Mean

163.569

Median

127.00

Mode

112

Std. Deviation

895.902

Variance

802639.803

Range

4616

Minimum

19

Maximum

4635

Sum

14439

 

 

Shape: This graph is the most skewed of all graphs. It is also skewed to the right and it also resembles a reverse j-curve more than a normal distribution curve. The curve is unimodal. There is again a very clear outlier which may or may not be indicative of a trend. It is possible that very high quality or valued Humanities packets obtain a much higher visibility than the other Humanities packets. The collection of future packets in the future as the population increase make help answer this question.

Center / Location: As with the previous graph, the most adequate unit of center for this graph is the media (127 views). While with a normal curve the mean may be a more adequate statistical variable, in this case having such a distant outlier influencing the curve, the median is more indicative of the center than the mean (481 views).

Spread / Variability: The variability of this curve is the most extensive having a range of 4616 views. The variance is also the most extensive as well as the standard deviation. While half of the values are below 127, the standard deviation is of 895.90 views.

 

 

Sept – Oct Views Comp

N Valid

30

Missing

0

Mean

97.8000

Std. Error of Mean

31.46052

Median

32.5000

Mode

15.00a

Std. Deviation

172.31635

Variance

29692.924

Range

894.00

Minimum

4.00

Maximum

898.00

Sum

2934.00

 

 

Shape: This curve is also much skewed to the right and also resembles a j-curve although to a lesser extent than the two other curves for Humanities packets. Since this curve represents the difference, it has a smaller center and range of values. As with the rest of the curve, this curve is unimodal. It does not resemble a normal distribution.

Center / Location: For this curve the median is again more indicative of the center than the mean.  The mean value is of 97.8 views, while the median is of 32.5 views. As with the other curves, the mode is not practically significant since there are few or no values repeated for all six curves.

Spread / Variability: Being a curve based on the difference or the change in views from September to October of Humanities packets its range, standard deviation, and variance are small than the other two curve for the Humanities packets. It has a range of 894 views and a standard deviation of 172.32 views.

 

Verifying Necessary Data Conditions (4 points)

 

The data analyzed in this assignment is grouped in two independent pairs. When conducting a t-test it is important to have a large sample size. The larger than sample sizes the more indicate they are of the population distribution and the smaller the sample error. Some samples of populations resemble a normal distribution while others do not. In the cases discussed above, most of them resembled a reverse j-curve, as the number of cases decreases quickly when plotted by the number of viewers. Another problem with the data collected is that there are strong outliers particularly when looking at the Humanities data curves. Despite having had a sample size of 30 (usually consider large) because of the high level of skewness, having more cases would have been beneficial. Having mentioned some of the problems with using this data, I will now test for significant relationships between the two variables, Humanities to Science packets, yet it is important to keep in mind that the data is skewed and while part of the selection was randomized, since items were selected from a website with its own classification rules, those rules may have negatively influenced the selection process. However, many other variables were controlled for: the data has been collected from the same site, the site became live and started counting views for both categories simultaneously, and the selection rules of one group were exactly the same as the selection rules of the other group.

Conducting a hypothesis test (10 points)

A one-sided independent t-test was conducted to test for significant relationships between the variables.

Group Statistics

  Type of Packet

N

Mean

Std. Deviation

Std. Error Mean

Sept Packet Views Humanities

30

383.50

726.039

132.556

Sciences

30

615.40

474.937

86.711

Oct Packet Views Humanities

30

481.30

895.902

163.569

Sciences

30

803.33

561.387

102.495

Sept – Oct Views Comp Humanities

30

97.8000

172.31635

31.46052

Sciences

30

187.9333

134.56006

24.56719

 

Independent Samples Test

Levene’s Test for Equality of Variances

F

Sig.

Sept Packet Views Equal variances assumed

.350

.556

Equal variances not assumed

Oct Packet Views Equal variances assumed

.535

.468

Equal variances not assumed

Sept – Oct Views Comp Equal variances assumed

.045

.832

Equal variances not assumed

 

 

Independent Samples Test

t-test for Equality of Means

t

df

Sig. (2-tailed)

Sept Packet Views Equal variances assumed

-1.464

58

.149

Equal variances not assumed

-1.464

49.978

.149

Oct Packet Views Equal variances assumed

-1.668

58

.101

Equal variances not assumed

-1.668

48.732

.102

Sept – Oct Views Comp Equal variances assumed

-2.258

58

.028

Equal variances not assumed

-2.258

54.781

.028

 

 

Independent Samples Test

t-test for Equality of Means

Mean Difference

Std. Error Difference

Sept Packet Views Equal variances assumed

-231.900

158.398

Equal variances not assumed

-231.900

158.398

Oct Packet Views Equal variances assumed

-322.033

193.028

Equal variances not assumed

-322.033

193.028

Sept – Oct Views Comp Equal variances assumed

-90.13333

39.91630

Equal variances not assumed

-90.13333

39.91630

 

 

 

 

Based on the results of the hypothesis test, do you reject or fail to reject Ho?  Why?  Do you feel your results are statistically significant?  Are they practically significant?  What is the p-value?  Interpret the p-value in your own words. You discuss the results of your hypothesis test.  What was the p-value?  Interpret the p-value in your own words.  Based on the results of the hypothesis test, do you reject or fail to reject Ho?  Why? Are the results statistically significant?  Are they practically significant?  (4 points)

Because of some of the differences in the variance of the data, I did not assume that they are from the same population. As such, the p-values were calculated with “equal variances not assumed”. The p-values obtained are .149, .102, .028. Since this was a one-sided t-test, those values can be divided by 2, but this also divides the alpha-level from .05 to .025 for significance. According to these p-values we can reject the null hypothesis when we compare the difference in the total number of viewers from Sept to Oct as the p-value is of .014 which is to say that in 1.4% of the time the conditions needed for the null hypothesis to be true are there. The p-value or probability values indicate how likely we are of obtaining a test statistics as extreme as the one obtained in this test. The other p-values obtained .075 and .051 are very close to the alpha-level yet, the null hypothesis cannot be rejected as it could result in a type 1 error.

 

The null hypothesis is rejected based on the results of the third t-test, yet the null hypothesis cannot be rejected based on the other two results that compared the total views from October and September for Humanities and Science packets. Because of the extent to which the values were skewed more cases would been beneficial to finding out if the first two test could also be significant. They were both very close to another commonly used alpha value of .10. However, using an alpha-value of .05, only the difference between the total number of views from September to October between Humanities and Science packets was significant. This is partly visible in the means of both types of packets. Humanities packets had a mean of 97.8 while the Science packets had a mean of 187.9. However similar ranges of difference were also visible in the mean for the total number of viewers. Yet, we can only safely conclude that having a p-value of .028, it appears that they number of views of Science packets is increasing statistically significantly faster than the total number of views of Humanities packets. Therefore, because of how unlike it is for this values to happen without, we can reject the null hypothesis –  H0: µ = µ0  (Null Hypothesis) and accept the alternate hypothesis – Ha: µ1 > µ0 (One-sided – Alternate Hypothesis) which as a one sided test has a p-value of .014. More cases are needed to have be certain that the other differences are or are not significant. To avoid a type 1 error, I cannot reject the null hypothesis with the other two tests, but I may be committing a type 2 error and would benefit from increasing the number of cases.

 

These results are also practically significant. It is clear that some types of packages are visited more often than others and this difference is linked to their category. While this difference may also be linked to their rating, it seems that more individuals visit Sophia to view and possibly use Science packets. This may be related to the relationship between computer science and mathematics and more directly to the relationship of the Open Source Software Movement to the Open Education Resource Movement. While it is not possible to determine this relationship from the data, understanding this difference could lead some business to market Humanities materials more aggressively to try to differentiate themselves, or have the opposite reaction and stop or reduce the production of Humanities materials so that the quality of the resources that are more likely to be used increases, the quality of Sophia increases, and marginal, low-quality products are discarded.

 

  • Based on the results of your hypothesis test, what kind of error could you have made?  Please explain, and indicate just how you might control for this kind of error.  (3 points)

 

As previously explained because of the skewness of the data, more cases would have helped in increasing the validity of the t-test results, yet despite not assuming equal variances, the results of the third test are clearly significant. Since I selected an alpha level of .05, the other two test with results of .149 and .101 were not significant. While the .149 value would not have been significant if the alpha level was of .1, the .101 result could have been seen as significant or could have been determined either way. Yet, it may have resulted in inappropriately rejecting the null hypothesis and committing a type 1 error. It is also possible that not enough cases were included and had more cases been included both of those values would have been significant. If this is the case, then type 2 errors may have taken place, yet without further testing, this conclusion cannot be reached. The only conclusion that can be reached from the data using a .05 alpha level is that the third test was statistically significant and therefore it is appropriate to reject the third null hypothesis.

 

  • A confidence interval is automatically generated when you conduct a t-test.  Please indicate what this interval is and how it should be interpreted.  (2 points)

 

The confidence interval is used to estimate the reliability of the data or the relationship between the sample estimates and the population estimate. For this test, I obtained the confidence intervals mentioned below. The greater the confidence level (usually it is either: 90%, 95%, or 99%) the greater the range of possible values. Confidence regions can help indicated if there are likely sampling errors as well as help indicate if one estimate for a quantity is unreliable, and as such if there are also others that may be unreliable.

Independent Samples Test

t-test for Equality of Means

95% Confidence Interval of the Difference

Lower

Upper

Sept Packet Views Equal variances assumed

-548.968

85.168

Equal variances not assumed

-550.055

86.255

Oct Packet Views Equal variances assumed

-708.421

64.354

Equal variances not assumed

-709.992

65.925

Sept – Oct Views Comp Equal variances assumed

-170.03449

-10.23218

Equal variances not assumed

-170.13457

-10.13210

 

Conclusion and Summary (8 points)

Did you discover anything that surprised you when you analyzed the data?  Do you think the results would have been different if you had bigger sample sizes?  If you had to do the project again, how would you do it differently?

You summarize your project and include some mention of how you came up with your idea, how you collected your data, and what you found when you explored and analyzed your data.  (3 points)

After collecting the data and analyzing the samples, we can conclude that there is a relationship between the types of packets that are viewed by users most often. As previously mentioned, this may provide a market advantage for those that produce high quality Humanities packets, or more likely lead for business to consider expanding their Science packets since there seems to be a greater use and demand for these packets. This is not to say Humanities packets should generally be discarded, but only that currently they are less viewed than Science packets. This is not a surprise as Khan’s Academy, one of the inspirations for Sophia.org primarily focuses on Science packets. In addition, the university that began the Open CourseWare Movement by opening their courses to the public is also a Science packets oriented institution. Sophia.org is experiencing the same trend and it may be due to Science packets not being as contextual as Humanities packets. Humanities packets may be more political and site specific. In addition, Humanities instructors tend to not be as technological oriented and innovation oriented as Science instructors. Sophia.org itself may be capitalizing of this trend as they recently began offering a paid service of over 20,000 math only videos to supplement classroom and home school instruction. The findings of this study were helpful in illustrating this trend.

  • You discuss any shortcomings of the methods you used to gather data.  (3 points)

 

Unfortunately, being only in its first public year, despite the rapid growth of the site, there are only a limited number of packets within the Sophia website. Because of the small number of packets, the packets are more indicate or more closely resemble the population, but they were selected via the ranking of the website instead of a randomizing process. While the categories were randomly selected, the site classification also influence the data selection process.

 

  • You talk about how you would do the project differently if you were to do it over again (2 points)

 

If this project was conducted again, additional data would be gathered. In addition, I would communicate with Sophia.org so that they would allow me to access their site analytics. While I collected the data that was openly available in a per month basis, this data is collected every day by the Sophia.org servers. Having access to their servers and site analytics would increase the depth and quality of the study. Another path to follow is to possible compare a different site or more categories. However, as I experienced initially when constructing this study, the differences in how sites are organized makes it difficult to compare between sites. Waiting until more data is available would help to more comprehensively develop a study of Sophia.org.