Learning Analytics: Tool Matrix / David Dornan
Tool (URL) / Description / Opportunities in Learning Analytic Solutions / Weaknesses/Concerns/ Comments
Data
One of the biggest hurdles in developing learning analytic tools is developing data governance and privacy policy related to accessing student data. The two initiatives in this section offer frameworks for opening access to student attention/learning data. The first initiative provides a start to developing data collection standards and the second provides inspiration on how/why it is not only feasible to deliver free open courses, it is also makes sense in terms of providing a community based research environment to explore, develop and test learning theories and learning feedback mechanisms/tools.
PSLC (Pittsburgh Science of Learning Center) DataShop
/ The PSCL DataShop is a repository containing course data from a variety of math, science, and language courses. / Data Standards
Initiatives like PSLC will help the learning analytics community develop standards for collecting, anonomizing and sharing student level course data. / Convincing individual institutions to contribute to this type of data repository may be difficult given that many institutions do not have data governance/sharing policies to share this type of information internally.
Open Learning Initiative
/ This is an exciting initiative taking place at Carnegie Mellon University. Students’ interaction with free on-line course material/activities provides a virtual learning analytic laboratory to experiment with algorithms and feedback mechanisms. / From Solo Sport to Community Based Research Activity
Herbert Simon from Carnegie Mellon University states that,
“Improvement in Post Secondary Education will require converting teaching from a ‘solo sport’ to a community based research activity.”
There are often two concerns related to conducting experimentation using learning analytics:
1. Privacy concerns related to accessing student related data.
2. Ethical concerns related to testing different feedback\instructional response mechanisms.
By offering free courses to student with full disclosure of how their interactions will be tracked and analyzed, these the two issues are no longer road blocks for conducting learning analytics research. As learning material/objects become commodities, the development of learning analytics tools that help guide and direct students will become what is valued and this requires that institutions build expertise in developing and sustaining the communities required to conduct community based learning research.
Database Storage
The majority of current learning analytics initiative are handled adequately using relational databases. However, as learning analytics programs begin to make use of the semantic web and social media tools, there will be a need to start exploring data storage technology that can handle large unstructured data sets. This section provides a brief description to the data storage required for LA programs.
Relational Database / For years we have used relational databases to structure the data required for our analyses. Data is stored in tables consisting of rows and columns. The columns are well-defined attributes pertaining to an object represented by a table. There are good open source relational database such as greenplum and mysql. However, most universities have standard supported RDMS offerings. At the University of Guelph we support both SQL Server and Oracle's RDMS. / Oracle provides a secure repository for structured data. The recent release of 11g also provides integration with the R engine permitting it to access data stored in the database.
NoSQL Database/Hadoop/Map Reduce / Hadoop is an Apache project inspired by Google's Mapreduce and the Google File System. It has become a standard for distributing large unstructured data sets. It provides a framework that can distribute large data set over a number of servers and can provide intermediate results as data flows through the framework's pipeline. / As learning analytics programs begin to make use of the semantic web and social media tools there will be a need to start exploring data storage technology that can handle large unstructured data. / Universities have good relational database infrastructures including expertise. As LA programs grow to include analysis of unstructured data, universities will need to develop skill and capacity to offer Hadoop data storage and retrieval services.
EC2
/ There are a number of companies that lease access to processing via virtual servers. Amazon’s EC2 is a common cloud server option available to host applications. / It is becoming common for organization to look at moving application to the cloud. For many of the traditional services, like the RDMS, there is resistance to cloud based deployments. This resistance is primarily due to privacy concerns and resistance to change. As LA programs require access to new technologies such as Hadoop and require infrequent massive analytical cycles, there may be an opportunity to introduce cloud-based offerings such as EC2. / The first assignment for this course (the development of a LA tool) provided me an opportunity to deploy an application using EC2. EC2 is a great way to explore new technologies. If mistakes are made one simply redeploys a new EC2 instance. There are many publically available instances that save time in deploying complete environments. In developing my LA tool, I deployed an Oralce XE instance (which required virtually no effort) and another RedHat instance where I installed RevoDeployR. Since RevoDeployR was a new tool for me, I had to start over several times before completing a successful installation. It is possible to create backup images in EC2. However, it was not as intuitive as creating a new instance.
Data Cleansing/Integration
Prior to conducting data analysis and presenting it through visualizations, data must be acquired (extracted), integrated, cleansed and stored in an appropriate data structure. The tools that perform these tasks are commonly referred to as ETL tools. Given the need for both structured and unstructured data (as described in the above section), the ideal ETL tools will be able to access and load data to and from data sources including RRS feeds, API calls, RDMS and unstructured data stores such as Hadoop.
Needlebase
/ Needlebase is a web-based webscraping tool that provides an easy to use interface to acquire, integrate and cleanse web-based data. As a user navigates a website tagging page elements of interest, Needlebase detects the underlying database structure and web navigation and automates the collection of the underlying data into a table of data. / Needle base is a great tool for accessing a websites underlying data when direct access to the data is not easily accessible. I have used Needlebase to create a lookup table for archived National Occupation Codes and to create a lookup table for our undergraduate course calendar. / There is no API access to the Needlebase scripts that are created. It seems best for one off extracts or for applications where the entire dataset is acquired using Needlebase tools. It does not seem all that useful for an integrated solution. One other restriction that I ran across using this tool was that it did not support accessing websites requiring authentication.
Pentaho Integration
/ Pentaho Data Integration (PDI) is a powerful easy to learn open source ETL tool that supports acquiring data from a variety of data sources including flat files, relational databases, Hadoop databases, RSS Feeds, and RESTful API calls. It can also be used to cleanse and output data to the same list of data sources. / PDI provides a versatile ETL tool that can grow with the evolution of an institutions learning analytics program. For example, initially a LA program may start with institutional data that is easily accessible via institutional relational databases. As the program grows to include text mining and recommendation systems that require extracting unstructured data outside the institution, the skills developed with PDI will accommodate the new sources of data collection and cleansing. / There are two concerns that I have with PDI:
1. Pentaho does not have built in integration with R statistics. Instead Pentaho data mining integration focuses on a WEKA module.
2. Pentaho is moving away from the open source model. Originally PDI was an open source ETL tool called Kettle developed by Matt Casters. Since Pentaho acquired Kettle (and Matt Caster), it has become a central piece to their subscription based BI Suite and the support costs are growing at a rapid pace. Twice, I have budgeted for support on this product only to find that the support costs have more than doubled year over year.
Talend / Talend is another open source ETL tool that has many of the same features as PDI. The main differences between PDI and Talend are presented in the following blog post:
http://churriwifi.wordpress.com/2010/06/01/comparing-talend-open-studio-and-pentaho-data-integration-kettle/
The main difference that from my perspective is that Talend is a code generator whereas PDI is not. I have also found PDI a much easier tool to learn and use. / Talend has the same strengths as described above with the additional benefit of having built in integration with R.
Yahoo Pipes / Yahoo provides this free web-based GUI tool that allows users to extract web-based data and create data stream that will cleanse, filter or enhance data prior to outputting the data via an RSS feed. / Since PDI and Talend seem to be able to provide the same ability as Yahoo Pipes I did not spend a great deal of time exploring Yahoo Pipes. However, it seems to me that Yahoo pipes could provide the webscraping functionality that Needlebase provides, yet offer a RRS feed output that could be picked up by either Talend or Pentaho in order to schedule nightly loads. It might be a more efficient way to pass web based data streams through various API's prior to extractions using PDI> / The one concern that I have wrt Yahoo pipes is that some of the unstructured data that will require analysis in a LA system will be posts by student. If a free public service like Yahoo Pipes is being used to stream data through various analytic API’s, we will potentially release personal student data.
Statistical Modeling
There are three major statistical software vendors: SAS, SPSS and R. All three of these tools are excellent for developing analytic/predictive models that are useful in developing learning analytics models. This section focuses on R. The open source project R has numerous packages and commercial add-ons available that position it well to grow with any LA program. Given that many researchers are proficient in R, incorporating the R engine into a LA platform also offers an opportunity to engage faculty in the development of reusable models/algorithms.
R / R is an active open source project that has numerous packages available to perform any type of statistical modeling. / R statistics strength is the fact that it is a widely used by the research community. Code for analysis is widely available and there are many packages available to help with any type of analysis and presentation that might be of interest. Some of these include:
1)  Visualization:
a)  ggplot provides good charting functionality.
b)  googlevis provides an interface between R and the Google Visualization API
2)  Text Mining:
a)  tm provides functions for manipulating text including stripping whitespace and stop words and removing suffixes (stemming).
b)  openNLP identifies words as nouns, verbs, adjectives or adverbs
c)  wordnet provides access to wordnet library. This is often used to replace similar words with a common word prior to text analysis.
Here are a few articles that show the power of using a few of these text mining packages:
1. Creating a wordle using tm and ggplot - http://www.r-bloggers.com/building-a-better-word-cloud/
2. Provides an overview of conducting text analysis using R - http://www.jstatsoft.org/v25/i05/paper
Oracle has also integrated R into it's 11g RDMS allowing R models direct access to RDMS data. / Although I really like R there are two issues that may be of concern to some universities:
1)  Lack of Support - only Revolution R provides support for the R product
2)  High Level of Expertise Required to Develop and Maintain R. How does a university retain people that have the skill required to develop and maintain R/RevoDeployR. However, since many faculty and students are proficient with R, perhaps building a platform similar to Datameer (see below) would allow R code to be community sourced allowing the majority of faculty and students to easily access and build their own learning dashboards.
Revolution R Offerings Including:
·  RevoDeployR
·  RevoConnectR
·  Integration with IBM Netezza / Revolution R provides support for the open source R engine and provides add on to enhance the integration and use of R within databases and websites. The RevoDeployR is a server-based platform that provides access to the R engine via a RESTful API. The RevoConnectR allows use of Hadoop stored data by the R engine. Revolution R also provides integration with IBM Netezza data warehouse appliances providing a scalable infrastructure for analyzing very large datasets. / Revolution R is the only commercial support offering for R. Revolution R will be useful for institutions that have procurement or risk management policies that restrict the use of open source products.
Revolution R tools are free for research purposes and their support contract or licenses for institutional purposes (i.e. learning analytics and dashboards) are very reasonable. I was quoted $4,ooo/core for RevoDeployR product. / The support that I received using RevoDeployR was very slow. However, I am not a supported customer.