DEXTER: a Light USER MANUAL

DEXTER: a Light USER MANUAL

DEXTER: a light USER MANUAL

Support for DEXTER V.1.0.0.1

This document is a swift survey of DEXTER features, free version, coming from an internal NEHOOV tool used for data wrangling and data analytics. We designed some features that are useful to us, before for instance the use of a Machine Learning downstream layer, such as PREDICT or NEURALSIGHT.

We hope you will enjoy the design of DEXTER & this way of doing analytics! This is a summary of 15 years of operational experience. Without programming, DEXTER allows you to obtain relevant findings in just few clicks.

We aim at receiving your feedbacks – ideally in terms of improvement proposals, and suggestion for adding some others features.

Last remark: this user manual was written by a non-English native, so try to be indulgent .

Purpose of DEXTER

DEXTER is a tool devoted for processing data with purpose to (1) streamline data, (2) offer some features for data analytics, including some stochastic methods. Following our ECO DATA ANALYTICS APPROACH, we aim at running DEXTER just on a laptop, even on databases that have tens or hundreds of millions of records. Minimal tech. requirements are UC core I5, and at last 6 GB of RAM. But 12 or 16 GB is better, of course.

We decided to design DEXTER with some free packages such as for the graph or the table components – these are not top of the top features, but have the advantage to be free. We didn’t want to fall in the mafia caveat of the unfree graphical components, because of… you know, our ECO DATA ANALYTICS approach ;)

Opening DEXTER

After launching DEXTER, you reach the general panel, mostly blank:

Fig. 1

In all this document, we will write VZ for the (left vertical zone), GZ for the top horizontal zone (‘Graph Zone’) and TZ for the bottom horizontal zone (‘Table Zone’). There is also an info bar IB on the top bottom, in blue, but that is most of the time not used in this free version.

As advised on IB when you open DEXTER, first you need to create a universe. It is a DEXTER object that will allow you to import databases, represented as nodes, as you shall see later. In a universe, you specify also by default some technical import parameters.

As you can see in Fig. 1, DEXTER features are split in 4 feature menus: Universe, Field, Wrangling and Tool. By clicking on this About menu, you will discover that it is the 1.0.1 version, without any limitation of use!

Importing data

After your first universe creation, as advised on IB, you need now to import database(s) just by opening the file menu and clicking on “Import”. On this free version, importing directly from databases (such as SQL or ORACLE) is not available. You can just import csv or text file(s).

Fig. 2

The browser import file(s) zone allows to select multiple .csv or text files that need to have the same “shape” in term of data localization. There is a “smart” import process that allows to consider files with void lines or comment cells. For instance, the following csv file can be imported:

Fig. 3

In DEXTER you obtain

Fig. 4

The following example, containing introductions lines, having line 10 as blank, having column H void except for H2,

Fig. 5

is imported in DEXTER as this:

Fig. 6

Remark that column H on the imported file that contains only one non-void value, is discarded by DEXTER, because we won’t do that much with this. So, DEXTER has a rudimentary AI for import.

Before continuing, let’s introduce some DEXTER vocabulary: in a data table, lines are called records, and columns, fields. In fact, a field is more than that, it is an attribute that has a name and a type, and a list of values. A record is just a realization of a collection of fields.

Remark also (Fig. 6) that after the import, void values are coloured in orange, and the “date” value of field y is coloured in red: DEXTER identified that field y is of type integer, hence a date value is not considered as acceptable. We will speak about cleaning later.

Coming back on the import pop-up (Fig. 2), there are 4 options:

Default block size allows to reserve a part of the RAM for DEXTER use.

Max degree of parallelism is the maximum number of simultaneous threads allowed to DEXTER. These 2 parameter values are coming from the universe preference (see later).

Comprehensive field identification, when ticked, allows to analyse all the fields during the import (e.g. different date formats in a field).

Compute stat when import allows to run or not a rudimentary statistic report on all the fields.

These 2 last options can be triggered later, after the import, and even field by field to gain time (on huge data set, you probably don’t need stat on all the fields).

After validation, the import process starts in a background mode. The background area on the top of VZ allows you to follow the progression of the import.

Fig. 7

When the import is ended, by clicking on the open button, 2 nodes are created in the graph zone TZ:

Fig. 8

In Figure 8 the file “evap” was imported, creating the data node “evap”, and an associated node that store the stat results. In VZ, a description zone allows you to change the name of the node, and to store comments (by default it is the path from the imported file(s):

Fig. 9

Finally, if you click directly on the node in GZ, you make it “green”, and a node zone appears on the VZ, providing some basic info about the node:

Fig. 10

As you shall see later, DEXTER handles many kind of objects (data set, visualization reports, stat reports…), all represented by nodes, with possible links. Here is an example of a more substantial universe:

Fig. 11

Figure 11 shows a universe of 12 data nodes, with many son nodes. It is clear that this user didn’t opt for clarity in the choice of his node names! Of course, with such a substantial universe, it is relevant to focus on specific zones: for this, on the VZ, there is the view zone:

Fig. 12

By clicking on the name of one data node it refreshes GZ and TZ, viewing only this node and its sons. By clicking on the Overview button, you come back to the full view of the universe. You can also play with the crux button of GZ (bottom right), that will allow you to design your view by yourself (we will describe the zoom feature later in this next section).

A remark: by adding a node to a universe, an algorithm recalculates all the positions of the others nodes in order to have a suitable general view. This might be quite surprising on the beginning.

I forgot an obvious thing: when you click on a data node, on the table zone TZ an EXTRACT of data is presented. When importing your first node, you must see something like this this:

Fig. 13

On VZ, at the bottom there is a Log zone, that will store all the actions you performed on the selected node. For instance:

Fig. 14

One can say a lot more about all of this, and it will be partly done in the next sections. Just remember that this document is a light user manual. You shall see by using DEXTER that a lot of information are also given in all the GUI (“graphical user interface”) in order to help you to understand the features!

The graph view philosophy

Why in NEHOOV did we opt for a graph view of data sets and son reports? After all, a universe is just some kind of folder where data sets and reports are stored! The answer is simple: in DEXTER, for most of the actions you will perform on a node, you will always be able to perform it by creating a subnode! To us it was then obvious to link all these nodes in graphs.

Lot of people are doing data manipulation directly on an Excel file, or a database, and sometime they corrupt the data set. In fact, not sometime, most of time! Remember that DEXTER is not devoted to data miners, but to operational people! Trying to help them to organize their sessions of work was one of our goal. Indeed, by creating subnodes, we keep the sanity of the father node. For lot of actions you are in no view precluded to create a subnode, you can stay in your data node. But we strongly advise you to do it!

This is also a nice way to work: you can “fork” your father node in 2 subnodes for instance, that can represent two distinct pieces of data (different time windows, different type of clients…). Working with a graph view force you, and help you to streamline your actions.

After settling this graph view paradigm, it was obvious to us to perform the same for all the reports: we designed them as specific nodes. In this free version of DEXTER, they are not numerous. Below is given the list of all our node icons:

Data / Stat report / Cross stat report / Coloured
XY-chart visualization / F-inverse clustering report / Topological
N-binning report / 2D t-SNE projection visualization

Fig. 15

Hence one icon for data, 3 for reports, and 3 for visualization features. In fact, the Topological 1-binning report is half a report, half a visualization… we will see this later, as for the fact that reports & visualisations are performed in TZ, temporary replacing the data table.

Let’s go back to graph visualization feature. When the graph is huge, as already touched upon there is a crux on bottom right that allows to perform some zoom features. Figure 16 gives the example of a 4-binning graph, in TZ zone, where the zoom crux is also present:

Fig. 16

After clicking on the crux, the zoom windows should open:

Fig. 17

Left & right arrows allow to go back & forth on the last selected zoom options. The ‘house’ icon allows to zoom on a small area on the top left of the graph. First of the 3 remaining icons allows to dezoom up to the global overview of the graph, second one allows to zoom on the ‘center’ of the graph, and third one keep the current zoom but center the view on the middle of the graph. The visible area in the zoom window is squared in red, and borders of this square can be moved in order to resize this square. One can obtain for instance the following view:

Fig. 18

If you click on the red square and keep the finger on it, you can move the square to another area, allowing you to wander in the graph! You can also click on any node in order to get informations in VZ. Figure 19 provides such info’s for a node of a 4-binning, where min & max for some fields are given:

Fig. 19

Again, the zoom features can be used both in GZ and TZ. Hence, for GZ, the data nodes and report nodes area, you have 2 possibilities if you want to zoom in a data node and its sons: the zoom crux, or the data node list in VZ.

The table zone TZ

First, an important point: DEXTER is not EXCEL. You cannot wander in the data table, for changing values or sorting in various ways. We decided not to allow freely manual update of data. The TZ view is also more rudimentary than Excel one (remember, we use free component). In fact, in practice, TZ should be reduced at its minimum: the data table is just here for illustration. And we decided to display by default just the first few records – this is also meaningful when one handle huge data sets.

In fact, DEXTER is devoted for data reshaping BY USING ALGORITHMS (not manually), data analytic reports and data visualization. This is it! If you want to manipulate by yourself the data you created in DEXTER, just export them and use your favourite tool!

Anyway, let’s have a survey of TZ:

Fig. 20

We added a freeze counter on the first column. Above the table, you can modify the first record you want to see, and the number of the next ones to view (“Nb records”). Same for the fields, with First visible field and Nb fields. You can also sort data according one selected field in a List box, and select the level of precision you want in the digit display of numbers. That’s all! These very basically options allow you to view pieces of data in a very big data set. Again, DEXTER is not a tool devoted for data table view.

A useful remark: if you reduce the number of visible fields, and leave DEXTER and forget about this, you might be quite surprised when you will come back: don’t be afraid that a part of your data disappeared, just check the value of Nb fields, and compare with the total number of fields given in the node zone!

Another important remark: in DEXTER the last field is flagged “Output” (see the purple circled last field in Fig. 20). This is by default and you can change this in the field menu (see later). This is an

implicit choice, and the reason lies in forecasting features: Neural Networks models generated by PREDICT can be used directly in DEXTER, and by default in PREDICT the forecasting target is the last field.

The Universe menu

We already used this menu for import, but there are some other features:

- Create subnode makes a son copy of the current selected node, that is a son
- Duplicate data node makes a copy of the current selected node, that is an independent node (no link with the selected node
- Delete node allows you to delete a node /
Fig. 21

- Create and duplicate are useful if you don’t want to overload the graph of the current node.

- Duplicate & Delete can be also triggered just via a right click on a node. Caution: the erased node is the node clicked by the user, NOT the current selected node. Thus you can select a data node, and then delete a son node just by right clicking!

- Manage nodes is a feature disabled in the current version.

- Export creates a csv copy of all the data in the selected node.

- Smart Export creates several csv files from 2 sets of fields. This is an example of a feature asked by one of our clients! Selecting this feature makes the following GUI appears:

Fig. 22

User selects one or several fields F1…Fp in the left table, and others fields G1…Gq in the right table. After clicking ‘ok’, it is mandatory to give a name to this smart export, for instance ‘sm1’. Then q files are created, with names ‘sm1 N(G1)’,…, ‘sm1 N(Gq)’:

Fig. 23

Each file ‘sm1 N(Gi)’ contains all data related to fields F1,…,Fp, and Gi.

Here is an obvious example of use : if in a data node there are 12 fields of monthly cash flow called ‘January’,…,’December’, and for a personal reason you need to build 12 files containing just one of these monthly fields and all the remaining data, you have just to

- Select all the fields except the 12 monthly fields in the left table,

- Select the 12 monthly fields in the right table.

Then the 12 files will be generated.

- Export descriptor generates a csv file that contains information about each field:

Fig. 24

It summarizes for each fields the info given in the node info zone and in the statistic report subnode of the selected node.

Finally, the last feature, Universe preferences, allows the user to settle the following parameters:

Fig. 25

The 3 desactivation features are linked with the Field menu features (see below).

When creating a new classify/truncate/shortened field F’ from a field F, if ticked the field F is desactivated after the creation of F’. Layout algorithm proposes 8 algorithms for the graph representation of the universe. Just chose the one you like it!

Default block size/max degree of parallelism and the next 2 tick boxes are the by default values of these 4 parameters when importing files. Default block size is ~ your RAM size/20, in order to be able to continue to use your laptop. For a night computation, you can increase the value, but one rule: try to be below 70-80% of your RAM size, in order to avoid Windows multiple swaps!

Binning size and binning overlap are the by default values when using the 1-binning algorithm of the Wrangling menu.

The Field menu

This menu is for field management.

Fig. 26

- Activate/Desactivate field is for making field visible or not in the data table. In fact, in DEXTER one cannot delete field, one can just deactivate it. However, if you switch some fields to the desactivation zone (right table of Figure 27), and if you select some of them and click on the ‘purge’ button, then these fields will be erased for ever (no possible recovery).

Fig. 27

- Move field is just a feature to change the place of a field or several fields in the data table, by selecting the field(s) to move and a target field, and then by selecting if the fields will be move before or after the target field (move before / move after).

- The 3 next features are just a swift way to desactivate fields when missing, missing or errors, or bad correlation occurs. Bad correlation uses the correlation table given in the stat report (see later), and desactivate all fields that have correlation coefficient with the output field below a specified value by the user.