Discovering and Visualizing the Social Structure of Academic Disciplines Through Text Mining

Discovering and Visualizing the Social Structure of Academic Disciplines Through Text Mining

1

III-CTX-Small: ScholarMatch: Technology to Discover and Relate to Scholars around the Globe

1 Introduction

An undergraduate asks: “Which scholars have the most relevant publications for my paper on nineteenth-century Japanese social history?” A graduate student asks, “Which faculty might be interested in reading my draft research on Latin American revolutions?” A community college teacher asks, “What scholars in my city might be willing to do a guest lecture on modern labor practices in the Middle East?” An early-career researcher asks, “Who is doing research on sexuality outside my field of specialization in U.S. History?” A tenure committee asks, “Who would be an appropriate evaluator of a colleague who works on pre-colonial Nahuatl-speaking cultures?”

Our project, ScholarMatch, will allow users to easily answer all of these questions. ScholarMatch will be a user-friendly portal that connects social science and humanities scholars, students, and teachers via their shared intellectual interests in historical topics. ScholarMatch puts academic databases to a new use through state-of-the-art text mining techniques. We use topic modeling to create a semantic map of the subjects covered in more than 800,000 abstracts of historical publications over the past two decades. This topic representation allows us to create webs of scholars whose works fall into similar categories. Users can view lists of scholars who focus on particular thematic areas; investigate the combination of subjects on which individual scholars work; or track how various scholars relate to one another. ScholarMatch goes beyond keyword search to organize scholars by their overall subject interests. Rather than connecting users to individual pieces of scholarship on particular issues, ScholarMatch connects people based on how closely their constellation of interests match one another.

Moreover, ScholarMatch is not just a passive information provider. ScholarMatch users will be able to upload their own abstracts, research papers, syllabi, lecture outlines -- or even just a page of ideas -- for on-the-spot topic modeling that will show how their work relates to that of other scholars. We envision that ScholarMatch will be useful to a large segment of the academic community, such as undergraduates writing papers, graduate students learning a field, researchers looking for collaborators, college teachers developing lesson plans, or institutions seeking expert reviewers.

We focus the project on the discipline of History, a field that straddles the boundaries between Humanities and the Social Sciences. It is Block’s field of expertise, and we already have access to major historical abstract databases. We match the expertise of a humanities scholar with the innovative skills of two computer scientists, both of whom already have significant experience in interdisciplinary work. These combined skills allow us to turn an enormous amount of text data -- in this case, more than 800,000 abstract entries plus thousands of CVs and homepages harvested from the web -- into a means to connect scholars around the globe.

We will produce the semantic web subject index that runs ScholarMatch with topic modeling – a relatively new text mining technique. The topic model is an unsupervised statistical language model that automatically learns a set of topics that together describe a collection in its entirety, while simultaneously organizing each document in the collection by topic. Topic modeling is an ideal way to extract and summarize structured information from unstructured data sources [9, 19-22].

We will run the topic model on more than two decades of abstracts (1985-2007) from America: History and Life and Historical Abstracts. AHL and HA include articles from several thousand journals published world-wide, thousands of book and media reviews selected from one hundred key journals, and entries for dissertations and masters’ theses. These data sets have already been collected.

In the first year, the ScholarMatch system will be based on writings from published scholars. Users will be able to browse the system and to have their writing temporarily topic-modeled as part of their search queries. In the second year, we will integrate an additional community and functionality: unpublished scholars and anyone not already in the system will be able to have their own writings permanently topic modeled to create a page for themselves in ScholarMatch. This second community will grow organically, driven by word-of-mouth and publicity campaigns. We expect students and teachers to be a significant contingent of this group. In the third year, the project team will conduct extensive assessments of the impact of ScholarMatch on diverse user groups and make improvements to the ScholarMatch system.

ScholarMatch benefits diverse populations of users in multiple ways. Its innovative text mining technology transforms how teachers, researchers and students can search and understand thematic fields of scholarship. By being publicly available, ScholarMatch can increase involvement of traditionally underserved communities who often lack access to digital or scholarly resources. Furthermore, ScholarMatch bridges gaps between teachers and researchers. Rather than a traditional database that revolves around published scholarship, ScholarMatch encourages teachers to upload teaching content for topic modeling. This will connect them to other teachers and researchers in their areas of interest. Likewise, students can upload their own writings and directly compare themselves to those whose books and articles they have read. Thus, ScholarMatch will promote many forms of collaboration by connecting diverse members of the academic community. Finally, ScholarMatch levels the playing field between new and established scholars by providing a resource for anyone searching for experts in a particular subfield, without having to rely on “old boy networks” or institutional affiliations.

By building a system that allows users to find and relate to scholars around the world, ScholarMatch facilitates engagement between communities of researchers, teachers, and students in new and productive ways. ScholarMatch allows anyone who can write a few paragraphs to become an interactive participant in the social web of scholarly work.

2 Intellectual Merit

ScholarMatch is uniquely situated to make contributions to multiple disciplines. It has significant potential benefits for humanities and social science scholars: it breaks down barriers between published scholars and other members of the academic community; it allows for improved search and discovery; it promotes collaboration; it allows one to summarize, track, and see who is working in sub-fields; and it provides increased networking opportunities to people who may lack institutional resources or access to influential leaders in their field.

ScholarMatch focuses on two areas of scientific research. The first area, computer science, relates to the text mining and modeling challenges required to make a working ScholarMatch system. We will investigate new topic models that better model the type of heterogenous corpora collected for the ScholarMatch system. The second area, informatics, relates to assessing the impact of ScholarMatch. In particular, we will assess the impact and usefulness of this advanced text mining technology on students, teachers at community colleges, early career researchers, and established researchers. This will allow us to particularly focus on the ways that ScholarMatch helps overcome divisions between research and teaching, and between new and established scholars.

2.1 Novelty of ScholarMatch

ScholarMatch will be novel in its wholescale application of text mining technology to fields far removed from computer science. Unlike the various academic literature digital libraries serving computer science and physics, such as CiteSeer, Rexa and arXiv (all of which are NSF-funded), humanities fields are far less likely to have such broad-based technological resources that take advantage of just-developing computer science text mining techniques.

ScholarMatch moves beyond keyword or fulltext search to topically categorize entire corpora. Rather than a search engine focused on finding individual pieces of scholarship, ScholarMatch provides users with overviews of thematic fields. Evaluating the total published scholarship -- and, for year-two self-uploaded users, the teaching/research interests – available in a given field allows users to identify others with in-depth knowledge of a given topical constellation. ScholarMatch’s focus on scholars over publications builds on the increasing popularity of social relationship websites like Facebook and MySpace to provide a format understood by the younger generation of scholars and students.

Most currently available digital library or abstract search systems are still traditional read-only search systems. Because these systems contain, and mostly benefit, published scholars and active researchers, they can inadvertently reinstitute academic separations and institutional hierarchies between teaching and research.

In contrast, ScholarMatch is not a static database to be queried; it is an interactive system to connect all members of academic communities. For example:

  • An undergraduate uploads a historic document to find scholars whose work will be useful for the research paper he is writing.
  • A community college teacher uploads her lecture notes to see who might best be able to answer a student’s conceptual question about her latest lecture.
  • An M.A. student uploads his research paper to see which Ph.D. programs have scholars that best relate to his research interests.
  • A conference committee uploads panel proposals to fill slots for commentators and chairs.
  • A new Assistant Professor uploads her research summary to find likely collaborators outside of people she knows within her regional and chronologic subfields.
  • A Search Committee uploads their job requirements (e.g., fields of scholarship) to find up-and-coming scholars who match their needs.

Thus, ScholarMatch users will not only be able to investigate relationships among thousands of scholars, they will be able to place themselves into that scholarly community.

2.2 Broader Impacts

ScholarMatch:

Integrates Research and Teaching: ScholarMatch uses technology to transform how both teachers and students integrate research into the educational process. Because topic modeling works well with unstructured text, community college teachers who don’t produce their own research might still be active users of ScholarMatch by uploading syllabi, lecture outlines, or even just course catalog summaries. ScholarMatch can provide direct connections between teaching topics and scholarship focused on that topical area. Undergraduate researchers and graduate students can use ScholarMatch to better understand the total scholarship in any given subfield.

Promotes Collaboration: ScholarMatch helps users find potential collaborators with similar interests, explains the connection by topic, and provides contact information for these scholars. Unlike scholars in fields that regularly publish with multiple authors, it can be less obvious to humanities/social science scholars who viable collaborators might be. Historians are increasingly emphasizing the value of collaborative research, and ScholarMatch can help scholars looking for conference panelists, invited speakers, or research and writing partnerships. Teachers can also use ScholarMatch to find other educators with overlapping pedagogical interests.

Levels Academic Playing Field: ScholarMatch effectively performs a double blind classification of anyone’s work. Furthermore, the topic model topically characterizes authors according to the subject of what they write, not the amount that they write, thereby leveling the playing field between new and established scholars. Because the abstract databases we use include dissertation and M.A. theses, early-career scholars will immediately be part of this scholarly community. ScholarMatch gives all interested users equal access to an array of networking opportunities, regardless of their personal connections, institutional affiliations, or geographic location. Thus, we see ScholarMatch as an excellent tool for leveling the academic playing field.

Reaches out to Traditionally-Underserved Communities: Because ScholarMatch will be freely available, users in academic communities that have been traditionally underserved (e.g. community colleges, underfunded B.A.-only institutions) can use ScholarMatch to increase their involvement in academic research and scholarship. Members of institutions without access to (increasingly expensive) proprietary academic databases will be able to use ScholarMatch to link to scholars around the world who work on topics of interest to them. ScholarMatch’s outreach and assessment efforts will include a specific focus on members of community colleges and traditionally underrepresented racial and ethnic groups.

Transforms learning experiences: By drawing students into scholarship, ScholarMatch will promote learning and discovery, even for those who are geographically isolated or at an educational institution without experts in their fields of interest. Uploading their own writings will make students part of an academic community. ScholarMatch allows students to directly compare their own interests to those scholars whose books and articles they have read.

2.3 Qualifications of Team

The PIs’ combined interdisciplinary expertise, proven record of successful topic modeling systems, and extensive experience with assessment well situates our team to undertake this multi-disciplinary project.

Dr. David Newman has experience in probabilistic language modeling and building software systems. He has already built prototype-versions of several components of this system. In 2006, he built the Calit2 browser (http://yarra.ics.uci.edu/calit2), which uses topic modeling to compare UC, Irvine and UC, San Diego researchers based on their publications. In 2007, Newman built the UC Irvine topic modeler (http://yarra.ics.uci.edu/sam), a demonstration topic modeling system for an outreach workshop for Cyber-Infrastructure for Humanities, Arts and Social Sciences. The UC Irvine topic modeler allowed workshop participants to upload their own writings for on-the-spot topic modeling. Thus, Newman has created prototypes of two crucial features of the ScholarMatch system. He has also already completed a trial topic modeling run of 40,000 abstracts from America History and Life, one of the two databases that will be the basis of ScholarMatch.

Dr. Sharon Block, an historian, is the domain expert on this project who will provide advice about the interpretability of learned topics and the accuracy of topic matches made by ScholarMatch. Block has already collaborated with computer scientists (including Newman) on several projects and has published on innovative technological approaches to humanities research [3, 19]. A scholar of gender, race and sexuality, Block also has a strong background in working with underrepresesented minorities and women. Finally, Block has been a consultant to digital document providers, serves on several digital humanities publisher advisory boards, and has conducted usability evaluations for the past four years. Block’s historic expertise, commitment to increasing diversity, experience with user interface evaluations, and proven track record as a collaborator on topic modeling projects will provide a useful skill set for ScholarMatch.

Dr. Bonnie Nardi is an Anthropologist by training, and is currently a Professor in a School of Information and Computer Science. Having expanded her initial academic training into a professional appointment in an Informatics department, Nardi has extensive experience working across traditional disciplinary boundaries. Equally importantly, Nardi is an expert on user testing and assessment. She will conceptualize, supervise and be responsible for answering the assessment research questions in this project.

3 Constituencies of Users

Multiple constituencies of users will benefit from finding and relating to scholars around the world. In this project, we will focus on two overarching communities: published scholars and unpublished scholars. Published scholars are in the system because they appeared in one of the databases (AHL or HA) over the past two decades. Unpublished Scholars will create their own entry in the system and upload their own text content. These two communities apply to both users of the system and people in the system. By creating a system where these two groups can effectively interact with each other, we bridge the divides between researchers, teachers, and students.

ScholarMatch aims to assess the impact of this technology on various groups of users. As such, we separately break out four different constituencies of users for assessment: (1) undergraduate students; (2) community colleges teachers; (3) early career researchers (which include graduate students and scholars within five years of the Ph.D.); (4) established researchers. While we believe that there will be other users who will find the proposed technology useful, (e.g. high-school students and intellectually-curious members of the general public), we focus on these broad groups because we anticipate that they will most directly benefit from ScholarMatch. Furthermore, these groups are more easily identifiable for user testing and assessment. We envision these groups using ScholarMatch in the following ways:

Group 1: Undergraduate students (likely unpublished scholars): Undergraduate students are increasingly comfortable relying on online resources and social networking websites like Facebook and MySpace. ScholarMatch builds on both these interests by allowing students to not only search for scholars online, but to place their own writings alongside published scholars. For instance, students doing a research project with historic documents or other primary sources might upload text from those documents to see whose published scholarship might best help them analyze these sources. ScholarMatch also promotes networking for students at institutions without experts in their field(s) of interest. A ScholarMatch analysis can allow a student working in an isolated or underserved institution to reach out to published scholars working on their precise area of interest. Finally, undergraduates can quickly see an overview of major scholars in an entire field to get a sense of whose work they should be reading to understand a thematic area.

Group 2. Community college teachers (some published and some unpublished scholars): Community colleges are more likely to employ faculty whose primary focus is teaching rather than research. Because topic modeling works well with unstructured text, community college teachers who don’t produce their own research might still be active members of ScholarMatch by uploading syllabi, lecture outlines, or even just course catalog summaries. This breaks down barriers between teaching and research and provides a means for teachers to effectively connect with one another based on their teaching areas.

Group 3. Early career scholars, including graduate students (some published and some unpublished scholars): Because our databases include M.A. and Ph.D. theses, graduate students and other early career scholars will automatically be part of the ScholarMatch database. Early career researchers can use ScholarMatch to find readers for their article and book manuscript drafts, locate panel participants for annual meetings, or find collaborators for anthologies, special journal issues or topically-based conferences. For early career scholars who did not have well-connected mentors at their graduate institutions, this ability to place themselves appropriately within a topical field of scholarship will allow them to more effectively advance their careers. Moreover, all early career scholars can benefit from access to the wider social network of academic specialists provided by ScholarMatch.

Group 4: Established researchers (published scholars): Established researchers can also take advantage of the networking opportunities and comprehensive picture of scholarly subject areas that ScholarMatch and topic modeling provide. Interdisciplinary work is becoming increasingly important in the humanities and social sciences, and finding collaborators outside one’s subfield is challenging even for established researchers. ScholarMatch will provide an accessible means for published scholars to see how their work relates to other users in ways that they might not have realized. ScholarMatch will also be of use to departments looking for appropriate tenure evaluators, editors looking for manuscript reviewers, or search committees looking for proven candidates in a particular subfield.