(An introduction to) Data Mining

Tommy Kaas

Journalist, partner

Kaas & Mulvad

Email:

From web pages

If your numbers/tables etc. are on web pages, it’s usually easy to copy the stuff into an empty spreadsheet. See handout for instructions.

After the import you’ll usually have to cleanse the imported stuff before you can use it (for calculations or anything else). See handout for instructions.

From pdf-files

If the pdf-file is made with Adobe Acrobat (from a document like an spreadsheet) it’s possible to extract the numbers/tables again.

If you have the full version of Adobe Acrobat (not the free reader) USE IT!

If another department in your paper has a license to the full version – ask to use it for this purpuse.

There is a special tool for extracting tables from pdf-files. This is very often the best solution .

If you only have the free reader, you can use this. But it’s a bit harder. Choose the text selection tool, copy the table, perhaps column by column, into an empty spreadsheet.

If you don’t do it column by column you’ll afterwards have to use the “text to columns” tool in Excel (Menu: Data – Text to columns)

You’ll perhaps have to cleanse this too after the import.

If the pdf-file is made like a picture (like a jpg) perhaps from a scanned original, you cant select text from the tables in Adobe Acrobat. You will now need an OCR application (OCR stands for Optical Character Recognition. OCR software takes an image of text -- such as, say, letter.jpg or letter.pdf -- picks out characters on a page as individual letters and numbers, and changes them into text that can be read by a word processor or spreadsheet editor). An OCR application can read and translate big files with lots of pages in one go.

From databases

If you need information from many web pages or from a database and don’t have weeks for doing it manually you’ll need an automated process.

If you can code in Perl, Python, PHP or something similar you can write your own application which can exctract data from specific pages/databases.

If you can’t code but have money you can buy applications. There are a number of applications. We use Kapow. It’s fast and efficient, but expensive.

Kapow is available in an open source edition too. It’s free but if you use it to build your own scraper robots it’s compulsory to upload any robot you make to an open library (where your competitors can find them).

Links:

Extraction from pdf

Adobe Acrobat 8 Professional (Get a 30-day free trial) http://www.adobe.com/products/acrobatpro/tryout.html

Examples of OCR Software

ABBYY FineReader http://www.abbyy.com/

OmniPage http://www.nuance.com/omnipage/

Examples of robot tools

Kapow http://kapowtech.com

Openkapow http://openkapow.com

2