Introduction to big data analytics and data mining
Instructor:Chong Ho (Alex) Yu, Ph.D.
Chong Ho (Alex) Yu has a Ph.D. in educational psychology (Arizona State University, ASU) with a concentration on measurement, statistics, and methodological studies, and a Ph.D. in philosophy (ASU) with a specialization in history and philosophy of science. His research interests include, but are not limited to, alternate and emerging research methods, instructional psychology and technology, psychology of religion, philosophical aspects of research methodologies, and cross-cultural comparison (e.g. PISA, PIAAC, TIMSS). Some of his articles are accessible at
Email:
Phone: 480-567-4782
Course Description
This course will cover research methods based on pattern recognition (e.g. exploratory data analysis, data visualization,text mining, and data mining/big data analytics) as well as philosophical concepts of quantitative research methodology. You will gain hands-on experience of statistical computing. By the end of this course, you will be equipped with powerful toolsfor analyzing both structured and unstructured data. Specifically, you will be able to independently carry out a professional-level research study for your thesis/dissertation, conference presentations, and peer-review journal papers.
Specific Student Learning Outcomes
The approach of this introduction is conceptual. Students will not be expected to memorize equations and formulas or to perform calculations by hand. Students who successfully complete this course will be able to:
1. Understand the limitations of hypothesis testing and the historical/philosophical roots of different research traditions.
2. Select the appropriate research method(s) based on the research goal and the data structure, and defend the rationale of the method choice.
3. Triangulate and validate the results using multiple approaches.
Software
JMP Pro Version 13 (required)
IBM SPSS Modeler 18 (Optional): will be provided by the instructor
SAS (optional)
Textbook and reading materials
Yu, C. H. (2014). Dancing with the data: The art and science of data visualization. Saarbrucken, Germany: LAP.
The book can be downloaded at The password to unlock the file is yu2. Please do not distribute the book to anyone or post it on any website.
PowerPoint slides and reading materials will be provided by the instructor.
Tentative schedule
Week / Topic / Core ideas and reading assignments1 / Limitations of conventional approaches and Introduction to big data analytics and data mining
/ The underlying logic of hypothesis testing is: Given the hypothesis, what is the probability (p value) of observing the data at hand in the long run (expressed by the theoretical sampling distribution) given that the null hypothesis (chance hypothesis) is true? As a result, many researchers are obsessed with reporting the p value. Indeed,what we really care about is: Given the data, what is the best theory to explain the observed phenomenon?
Optional:
Optional:
Optional:
With the advance of internet technology and the resulting information explosion, it is very common for researchers to deal with extremely large data sets. However, most conventional methodologies are not suitable to big data analytics. How should psychological researchers cope with this paradigm shift?
2 / Abductive reasoning and exploratory data analysis / Exploratory data analysis (EDA), which is the precursor of data mining, is data-driven and exploratory in essence (non-inferential). It does not start with a strong theory and let the data speak for themselves. Indeed, both EDA and data mining are abductive in nature.
Required ((The instructor will send the full text to the class):
Yu, C. H. (2017). Exploratory data analysis. In D. Bricken (Ed.). Oxford Bibliographies. New York, NY: Oxford University Press.
Optional: Yu, C. H. (1994, April). Induction? Deduction? Abduction? Is there a logic of EDA? Paper presented at the Annual Meeting of American Educational Research Association, New Orleans, LA. (ERIC Document Reproduction Service No. ED 376 173) Retrieved from
3 / Data visualization / Relying on numbers alone may lead to erroneous conclusions. Data visualization includes a plethora of techniques for checking assumptions, spotting outliers, and detecting patterns in the data. Seeing is believing!
Required: Yu (2014), Preface and Chapter 1
Optional: Yu, C. H., & Stockford, S. (2003). Evaluating spatial- and temporal-oriented multi-dimensional visualization techniques for research and instruction. Practical Assessment, Research & Evaluation, 8(17). Retrieved from
4 / Data mining: basic concepts / Data mining is the process of automatically extracting useful information and relationships from immense quantities of data. Data mining does not start from a strong pre-conception, a specific question, or a narrow hypothesis; rather it aims to detect patterns that are already present in the data. Basic concepts such as supervised machine learning and unsupervised machine learning will be covered. Artificial neural networks will be demonstrated as an example of data mining.
Optional: Yu, C. H., Lee, H. S., Gan, S., & Brown, E. (2017, September). Nonlinear modeling with big data in SAS and JMP. Paper presented at Western Users of SAS Software Conference, Long Beach, CA. Retrieved from
Optional: Yu, C. H., DiGangi, S., Jannasch-Pennell, A., & Kaprolet, C. (2010). A data mining approach for identifying predictors of student retention from sophomore to junior year. Journal of Data Science, 8, 307-325. Retrieved from
5 / Generalized regression and Decision tree / When there are too many predictors in a regression model, it violates the assumption of independence among predictors. As a remedy, generalized regression enhances conventional OLS regression by exploring different paths to the solution.
Required:
The decision tree approach is a quick and accurate way of identifying predictors and classifiers when there are many independent variables. This procedure is non-parametric and immune against outliers.
Optional: Yu, C. H., DiGangi, S., Jannasch-Pennell, A., & Kaprolet, C. (2008). Profiling students who take online courses using data mining methods. Online Journal of Distance Learning Administration, 11(2) Retrieved from
Optional: Yu, C. H., Jannasch-Pennell, A., DiGangi, S., Kim, C., & Andrews, S. (2007). A data visualization and data mining approach to response and non-response analysis in survey research. Practical Assessment, Research and Evaluation, 12(19). Retrieved from
6 / Resampling and ensemble methods / One major weakness of classical procedures is that the result yielded from a single study might not be replicable due to overfitting. The ensemble approach, such as the bootstrap forest and the boosted tree, run multiple models by repeatedly sampling from the same data set. In theend, it produces a stable and coherent model by merging all sub-models.
Optional: Yu, C. H. (2007). Resampling: A conceptual and procedural introduction. In Jason Osborne (Ed.), Best practices in quantitative methods (pp. 283-298). Thousand Oaks, CA: Sage Publications. password: yu2
Required: Yu, C. H. (2003). Resampling methods: Concepts, applications, and justification. Practical Assessment Research and Evaluation, 8(19). Retrieved from
Optional: Shiao, P., Grayson, J., Yu, C. H., Wasek-Patterson, B., & Bottiglieri, T. (2018). Gene environment interactions and predictors of colorectal cancer in family-based multi-ethnic groups. Journal of Personalized Medicine, 8(1), 1-21. doi:10.3390/jpm8010010. Retrieved from
7 / Cluster analysis, EFA, and PCA / Cluster analysis is a data reduction and profiling technique. It can reduce many variables into one grouping factor and classify observations by their characteristics. Various clustering techniques, such as K-mean clustering, hierarchical clustering, two-step clustering, and DBSCAN, will be discussed.
Optional: Yu, C. H. (2012). Beyond Gross National Product: An exploratory study of the relationship between Program for International Student Assessment Scores and well-being indices. Review of European Studies, 4. doi:10.5539/res.v4n5p119 Retrieved from
Both Principal Component Analysis (PCA) and Exploratory Factor analysis (EFA) are dimension-reduction techniques, but there is a subtle difference between them. PCA is for convenient reduction, whereas factor analysis is for identifying latent constructs. Additionally, the former considers all variances while the latter extracts shared variances.
Required:
Optional: Yu, C. H. (2011, October). Principal component regression as a countermeasure against collinearity. Paper presented at Western Users of SAS Software Conference, San Francisco, CA. Retrieved from
8 / Text mining and final presentation / Most software packages are made for structured data, but indeed the amount of unstructured data (open-ended responses, websites…etc.) far exceeds that of structured data. Text mining is an efficient way to tap into unstructured data for thematic analysis.
Required: Yu, C. H., Jannasch-Pennell, A., & DiGangi, S. (2011). Compatibility between text mining and qualitative research in the perspectives of grounded theory, content analysis, and reliability. Qualitative Report, 16, 730-744. Retrieved from
Optional (The full text will be provided by the instructor): Yu, C. H., DiGangi, S., & Jannasch-Pennell, A. (2011). Using text mining for improving student experience management. In P. Tripathi & S. Mukerji(Eds.), Cases on innovations in educational marketing: Transnational and technological strategies (pp. 196-213). Hershey, PA: IGI Global.
Optional: Yu, C. H. (2015). Are positive trait attributions for the deceased caused by fear of supernatural punishments?: A triangulated study by content analysis and text mining. Journal of Psychology and Christianity, 34, 3-18 ( password: yu2).
Attendance (10%)
Class attendance is required. Learning is an interactive process, and it is impossible for you to participate if you are not here. Attendance will be taken at every class period. PLEASE BE ON TIME! Three late arrivals (15 minutes or more) or three early departures (15 minutes or more) are counted as one absence. Acceptable reasons for excused absences include personal illness, funeral services, conference attendance/presentation, jury duty, family crisis, job interview, military services, or representing the university in a sporting event. Unacceptable reasons include work, vacation, partying, babysitting, dating, and shopping. If your unacceptable absence is excessive (10% or more of all classes), you will lose 1 point for each absence.
If you miss class, it is your responsibility to obtain lecture notes, activities, assignments etc. from another student. The professor is not responsible for making sure that you have lecture notes if you miss class. The professor will provide information on activities or assignments if there are questions, but it is your responsibility to follow-up to get this information.
Computing assignments and discussions (20%)
There will be computer-based assignments and group discussions throughout the course totaling 40% of the final grade in the class.
Team presentation (30%)
You will employ one or more research methods in the class to analyze a data set, and then present the findings in the last class. You are welcome to download archival data sets or collect your own data. In your career, you will need to work effectively in diverse groups to complete complex tasks. To do this successfully you must communicate well in a group setting. Please form a team to present your research project. Although there is no length requirement for your PowerPoint, the presentation mustbe at least 20 minutes in length.
Exams (40%)
There will be two non-cumulative exams (20% each) in the course covering the readings and discussions. The exams are open-book and take-home. The exam questions will be similar to the review questions.
Policies
Late work and make-up exams/assignments: Please put the deadline into your calendar and set up reminders. Late assignments will not be accepted. NO EXCEPTIONS. No make-up exams or assignments will be allowed unless a VERIFIABLE written medical excuse from a physician is received that SPECIFICALLY STATES THE FOLLOWING:
- Date (which must correspond to the date of the missed assignment and/or the exam)
- Illness
- Physician’s Contact information
Work and attending an event that is not academic-related are not considered valid reasons for missing an assignment or an exam (Presenting in a conference is an academic-related event).
The reason for the stringent policy is so that all assignments/exams can be read and evaluated at the same time, with the same criteria, and returned promptly.
Use of Laptops, tablets, and Smartphones: You are encouraged to bring your laptop or tablet to class to take notes and do assignments. You may not use these devices for any purposes unrelated to class (e.g., e-mails, instant messages, games, auctions, web browsing etc.). Other disrespectful or disruptive behaviors, such as arguing with the instructor, cell phone use, and talking with classmates about subjects unrelated to the class will result in dismissal from the classroom and the deduction of 1 point from your course grade.
Expectations regarding academic integrity in this class:
- Expectations are consistent with those outlined in the academic policy especially as they pertain to definitions for cheating, fabrication, and facilitating academic dishonesty.
- Any material (including but not restricted to textbook, other texts, journals, magazines, websites) incorporated into your assignments (reaction papers, term paper, other non-examination writing assignments) must be properly cited. Plagiarism will not be tolerated.
- Plagiarism is defined as intentionally or knowingly representing the words, ideas, or work of another as one’s own in any academic exercise. The instructor may use an online service to assess for plagiarism in the student’s written work.
- Any type of plagiarism, cheating during an examination, turning in assignments not written by the student, and making up or altering data will result in an “F” grade for the course.