Computer Programming for Journalists

Instructor: Charles Seife

Purpose

"News is what somebody does not want you to print. All the rest is advertising."

Data and documents can help you see through lies and to expose facts that certain people would rather keep hidden. This course is designed to give you the computer savvy to capture, extract, and manipulate data through computer tools -- as well as giving you a reasonable knowledge of how to use the programming language Python.

Classes, assignments, and expectations

A word of warning: this course is progressive; each new topic will depend heavily upon ideas introduced in previous weeks. If you fail to keep up for even a short time, you can quickly get lost and wind up playing catch-up for the rest of the semester. Therefore, it behooves you to come to class. It also is important to hand in your assignments on time. As the course is front-loaded toward data processing and manipulation -- so that you'll have these tools in hand if you need them for the class assignment -- be aware that you can feel quite lost quite quickly. Please come and see me ASAP if you're having trouble; there's no shame in seeking extra help to understand difficult subjects!

Any take-home exercises will not be overly arduous, but they will involve tasks such as looking through data, doing research, or writing programs. These exercises are an important part of the course, and they should completed and ready to hand in before class begins each week. Unless instructed otherwise, please work on them alone; do not collaborate with fellow students unless instructed otherwise.

Deadlines are very important in journalism; you are expected to meet them. Late work will be penalized. Lateness and absences are no more acceptable in the class than they would be in a work environment -- and there must be an extraordinary reason, sent to me in advance of the absence, for missing even a single class. (Religious holidays are acceptable reasons -- but you must give notice several weeks ahead of time if you are likely to miss class.)

Unethical behavior, such as plagiarism or fabrication, is enough to wreck a seasoned journalist's career; the consequences of such a breach by a student will also be extremely severe.

There is an expectation of some privacy within the classroom walls, though that expectation is not absolute. If you blog, twitter, or otherwise publish about the class, I expect you not to reveal the identity of any fellow student without his/her permission. In a similar vein, I am hoping that we (and any guests) will discuss extremely sensitive matters -- including details of what promises to be a very interesting investigation -- with complete candor. As a result, I expect you not to treat the class discussion as fully on the record. At the same time, the environment is not completely protected -- you have the right to discuss the substance of the class (and to critique it.) As all good journalists must, you must exercise judgment in balancing the public's need to know (and your right to express yourself) with concerns about privacy and sensitivity of information. If you have any questions, feel free to ask.

Texts, hardware, and software

We will be using computer tools to assist in our investigations. Please bring a laptop -- not a smartphone or tablet -- to class every day.

We will be using the following software:

-- A sophisticated spreadsheet. Microsoft Excel is ideal. Numbers for Mac is not going to work for the purpose of this course; either use MS Excel, or LibreOffice (which is a free suite of software usable on Macs.)

-- Database programs. Most likely this will mean SQLite and/or MySQL. SQLite runs as an extension to the Firefox browser, so if we are using that program, I will instruct you on how to use it. If we use MySQL, I will have it installed on a virtual machine (see below.)

-- A GIS program. The ne plus ultra of GIS software is ArcSoft GIS, but it's expensive. There is a much cheaper alternative, called Quantum GIS, which is almost as good, and it's free. I will instruct you on how to install Quantum GIS on your computer.

-- Tableau. This is a free-to-start bit of software that is increasingly in use among the data literati. It has rudimentary GIS ability as well as tools for making much more interesting graphs that Microsoft Excel alone is capable of.

-- Oracle VM VirtualBox. This is a program that allows you to run a virtual machine: a computer that will run within your computer. It is free. The virtual machine I am building will be a linux-based machine that will allow us to do database work together. I hope to have other software -- such as spreadsheets and GIS software -- installed as well. However, virtual machines are slow, so you'll probably prefer to install the spreadsheet/GIS software on your own computer rather than run it within the virtual machine. (If this sounds confusing, don't worry... I'll explain in class.)

--Python 3.x. We will discuss the installation of this programming language on your machines in class.

Please wait to make any serious purchases of software or hardware until you discuss your plans with me.

I think it is helpful to have a program that prints files to pdf format -- it allows you to keep a personal archive of online news sources that might be altered or disappear. Macs have this ability already. For PCs Adobe Acrobat gives you this ability. You can also find many open-source programs (such as PDFCreator) that give you the same ability for free.

Another useful tool is a jump drive: one 16 GB or above will allow you to transport large databases around should you need to.

Keep on top of your email; I will be drawing your attention to certain articles and will be sending you additional materials and readings.

Grading

Class participation, in-class exercises, and effort: 25%

Midterm, problem sets and take-home exercises: 40%

As noted above the midterm will be an in-class exam.

Final project: 35%

The final project will be a programming "feat" that you will decide upon during class. This "feat" will use a substantial program of your own design that will do work that is of interest to a journalist: it could be a webscraper, a program to handle or convert specialized data, a program that displays information in a novel way... essentially, it will be something that transforms data from a state that's not easily usable into one that you (and other journalists) can use for stories.

Tentative class schedule

Week 1:

Introduction to data. The sanity check. Bogus numbers.

Data as a double-edged sword

Data versus metadata

Surveillance and sousveillance

Week 2:

Data and numeracy

Descriptive statistics.

Boolean Logic

Spreadsheets, part I.

Week 3:

Spreadsheets, part II.

Cleaning data.

Project: Reporting plans

Week 4:

Databases, SQL.

Week 5:

Python I: Hello, world! Introduction, simple I/O; data types and variables

Week 6:

Python II: Looping and branching; strings

Week 7:

Python III: Simple debugging tips; file (and web) I/O

Week 8:

Midterm Exam: This will be a timed, in-class, open book/open note/open laptop exam, but no internet access.

Week 9:

Python IV: Arrays, lists, dictionaries, and data structures

Week 10:

Python V: Functions; regular expressions

Week 11:

Python VI: Intro to object-oriented coding

Week 12:

Display of information: Graphs, GIS

Python VII: Graphics I

Week 13:

Python VIII: Graphics II

Week 14:

Efficiency concerns; looking to the future

NOTE: LAST DAY OF CLASS

Contacting the professor:

Office: 20 Cooper Square, rm. 628

Telephone: 212 998 7894

e-mail:

skype: cgseife

Office hours:

TBA

I am also happy to meet at other times if you make an appointment. Don't be shy -- if you need any advice or direction, feel free to come by.