Click to show TOC

Data Frame

Package Deep Dive

You can explore the package and explore the some of the core caplbilities by clicking Package Deep Dive

 
 
     

Summary

Frame Docs

This is a database like data manipulation library. It lets you do simple database SQL like operations on in-memory tabular data. The main component of the library is a Frame. This is a data structure that contains tabular data made up of of rows and columns. Each column has a name, and just like the columns of a database table the names must be unique. The tool has been tested with 1 million rows with good performance. The total amount of data is limited by the memory of your browser or node instance. All elements (cells/column) are stored as strings or numbers.

A Frame That conversion is done by the frameFromBuffer function. Please note thst the frame is aware that some numbers may be stored as strings, and most of the utilities will treat numbers in string format as if they were numbers. Strictly speaking you can store any any javascript primitive is a cell, but that has not been fully tested and so buyer be aware.

The operations include the ability to do the following:

  • Convert CSV, TSV data to a frame
  • Create a subset of the data using the filter or select method similar to the sql select, including
    • Add new columns, optionally with new data
    • Reorder couumns
    • Rename columns
    • Systematically change data is columns
  • Join a frame with another frame similar to SQL innerJoin, leftJoin and outer join
  • Sort data of several columns ascending or descending on individual columns
  • Group by operation base of a set of columns, with grouping operation on some of the other columns:
    • gb.count count the number of non-empty values is a group
    • gb.max - maximun value of a column in a group
    • gb.min - the minimum value of a column in a group
    • gb.min - the minimum value of a column in a group
    • gb.mean - the average value of a column in a group (ignoring empty values)
    • gb.stdDev - the standard deviation value of a column in a group (ignoring empty values)
    • some others

Introduction

The inspiration for data frame came from two sources, SQL and the powerful python pandas utility. Frame does not attempt to provide all the functionality of pandas or that of SQL. The examples provided here will give a taste of some of the things it can do.

Overview

  • Data Frames strongly support a functional style of programming
  • One a Frame is created it is never modified
  • All Frame operations create a new Frame
  • Frames use data sharing to minimize the memory footprint
  • Internally the data is stored as an array of arrays
    • aFrame.data is an array of rows
    • Each row is an array of column values
  • aFrame.columns is an array of strings representing the columns names
  • Rarely will you have to access the data directly,
    • You must never modify the column or data array
    • Accessing the column data using a index is very inconvinent and rarely do you have to do this
      • For your convinence a row object is created for ease access




Page - 1




Getting started

we have to bring in the str-data-frame npm module. We have several choices, using javascript es6+ module symtax, using webpack or similar bundler, or just as a script. For rhe purposed of this tutorial I am using the good old script method. I creates in the global namesapce DataFrame variable.

DataFrame package

Lets do a quick look a what is defined in the package. In this html page we use the following unpkg.com to load the str-data-frame package. The exported name from the Package is DataFrame.

  <script src="https://unpkg.com/str-data-frame@0.2.18/dist/bundle.js"></script>
  • Get the names (keys) in the module using Object.keys(DataFrame)

  • map the list of names to name/value pair [name, value]

  • map the [name, value] to [name, type of value, display string for the value]

  • result is put into data

    Note: We use he two following helper function

    Display function D(value)

TryitJS Source
!js
function D(d) {
    if(d === undefined ) return "<undef>";
    if(typeof d === 'function') return "<func body>"; 
    if(d instanceof Date) return d.toString;
    if(typeof d === 'object') return JSON.stringify(d)
    return d.toString();
}

function getType(d) {
if(d === undefined) return 'undefined'
if(Array.isArray(d)) return 'Array';
if(isClass(d)) return "class";
return typeof d;

}

Quick Start

So you are in a hurry and would like to see if Frames are yor you. If you have table like data, Data Frames are for you. Think of them as fast in-memory database tables.




Page - 2




Get Data - World Covid Report

Use 'fetch' to get some raw data

We will use world covid-19 daily statistics, this is a csv file that we will split into an array of lines. We can use the built in fetch function to pull data from the public repository on Github. Fetch returns a promise. To learn more about Promise on the link. Once the promise resolves, the data will be available for use. As always last expression in the script will be displayed. If the value is a promise, TryITjs will wait for the promise to resolve and then display the result.

 
 
     

Split into lines

Since csvData is just a big string we can use the string.split opertation to convert it to an array of lines, where each line is a string.

 
 
     

Create a frame from the data

Since the CSV data is just a big string, to converts it to a frame we have to do the following

  • Split the string into an array of lines, and javascript string already supports the .split('\n') method
  • Now each line is a string, we have to split each line into an array of columns
    • We cold use split again on " , " but CSV files allow us to escape a comma by surrounding it with " , " (comma). Since split will split on every comma we cannot use that.
    • csvLine(aLine) will split a 'CSV' line string into an array of column values. This is also known as a line splitter, the following line splitter exist
      • csvLine - for CSV files
      • tsvLine - for tab seperated values
      • psvLine - '|' pipe seperated value. THis si a common format in healthcare
  • The columns of the first line is used as the column names, the remaining lines are the data
  • We have a convinience function to perfrom this task `Frame.frameFronString(string, )``

The following code is a simplified version of what .frameFromString does

//frameFromString is at its core the following code
   var lines = csvData.split('\n').map(csvLine);
   var covidFrame = new Frame(lines.slice(1), lines[0]);

Create a frame from string in the CSV file using the script below. Once we have the Frame we can show the first 20 rows using .slice similar to slice on a Javascript Array, and as always the last expression is displayed in the results area.

 
 
     

What happened

We converted a string to a frames.

  • csvData The string beinng the contents of a CSV (comma seperated values) representation of a table.
  • csvLine _function to convert one row (in string form) into an array of strings.
  • Note: The frame data is nicely rendered in the output area. TryITjs when displaying data checks to see if the object implements obj._toHtml() and uses that to render the display.

In the next section we will see how we can further improve Frame rendering by overriding the _toHtml method. Although in the tutorial for Frame we will not be using this capability, but it is a useful thing to know for those creating their own TryITjs tutorials.




Page - 3




Improving rendering of frame

A Frame has a method ._toHml() to render a frame as HTML. This is a nice Table rendering module DataTable. Details for useing this capability can be found here: DataTable - a plug-in for the jQuery Javascript library.

More details about rendering a DataFrame (or any other object) to HTML:

We can customize the rendering as follows:

 
 
     




Page - 4




Lets plot some data

We will plot some data from the covid Frame for USA

  1. First select data for USA
  2. Use plotly.js to plot come data
 
 
     

Notes on plotting

Tryitjs does not have plotting out of the box, but capability was added with the following code.
Click to show the code

Source Code for Plotting Helpers
!head
<script>
function Identity(x) { return x}
function genID(prefix) { return (prefix || 'T')+Math.round(Math.random()*100000); }
function capitalStr( phrase ) { return phrase.replace(/\b\w/g, c => c.toUpperCase());}
function addDiv(id, ext) {
    let targetDivName = `#${id}_${ext}`;
    let targetDiv = document.querySelector(targetDivName);
    if(!targetDiv) {
        targetDiv = document.createElement('div');
        targetDiv.id = targetDivName.substr(1);
        let sibling = document.querySelector('#'+id);
        sibling.after(targetDiv);
    }
    else targetDiv.innerHTML = '';
    return targetDiv;
}

function plotDiv() {
    let dn = $$.executeDiv;
    return addDiv(dn+'-display','plot'); // add a div for plotting after the display div
}

function HTML(data) {
    if(data === undefined || typeof data._toHtml === 'function') return data;
    return { _toHtml: () => data }
}

function multiSelect(opts,list) {
    let {onchange, id} = opts || {};
    id = id || genID();
    let html = `
        <div id="${id}" class="ui multiple selection dropdown">
          <input name="thick" type="hidden">
          <i class="dropdown icon"></i>
          <div class="default text">100</div>
          <div class="menu">
            ${list.map(([key,val]) => `<div class="item" data-value="${key}">${val}</div>`).join('\n')}
          </div> 
        </div>
    `;
    
        onchange = onchange || Identity
        $$.lastly(() => {
            $('#'+id).dropdown();
            if(opts.onchange){
                $('#'+id).change( function(e) { 
                 let list = $(e.currentTarget).dropdown("get values");
                 onchange(list,e);
               } );
            } 
         });
    
    return HTML(html);
}
</script>

!head
<script>
  function newPlot(dn, data, layout,config) {
   
   if(typeof dn === 'string') {
      Plotly.newPlot(dn, data, layout, config);
      return dn;
   } else {
      let div = genID($$.executeDiv);
      let [_data, _layout, _config] = [dn, data, layout]; // no dn parameter
      $$.lastly(true,() => Plotly.newPlot(div, _data, _layout, _config));
      return HTML(`<div><div id="${div}" /></div>`)
   }
  }
 </script>




Page - 5




Organize the covid data

Select Critical Columns

In this section we will show how to:

  1. Filter unnecessary rows
  2. Select the following columns from covidFrame:
    • iso_code -
    • location -
    • continent -
    • date
    • new_cases
    • new_deaths
    • moreā€¦
    • mortality _Computed column 100*total_deaths/total_cases percent of people who die after contracting COVID-19
  3. To create a extra column provide the following [ columnName, functionToComputeValue],
    • The name of the column should not conflict with the name of any other column (in the selection)
    • The alternate to a function is to provide a value
    • You can use this to modify value of an existing column
  4. Change the name of a column 'oldName=newName'

Note we create a new frame, the riginal frame 'covidFrame' remains unchange. Frames should always be treated as immutable data. Modifying a frame may prevent it working correctly. It is highly recommended that you never manipulate the internal (instance) vatiable of a frame.

 
 
     




Page - 6




Rolling Up Data

This section will demonstrate the groupBy capability of DataFrame to create summary of data.

Roll up data into continents

Next we want to rollup the data into continents, for this we will use the group the data by continent

 
 
     

This is not quite what we need, since this is just giving us the continent names, we also need to keep the date column

 
 
     

Next we want to aggrigate the total_cases and total_deaths, so we need to sum (add up) the thos value for every country in a continent and also on that date. We have a operation for that in the groupBy,

  • gb.sum('column_name')
  1. Remove locations that don't have an associated continent using (.filter)
  2. Group by continent and date
    1. In each continent there will be many countried so we have to aggrigated the data for the individual countries.
      • total_cases - sum them (gb.sum('total_cases'))
      • similarly for total_deaths, new_cases, new_deaths
      • We have to compute total_deaths_per_million and total_cases\per_million_
  3. After the groupBy operation we have the data to compute data for the two placeholder columns
    • We use the .update method to two placeholder columns
 
 
     

What we did

We saw earlier the original 'covidFrame' contains a large quantity of daily covid information for every country. A row in that frame represented a country and date and information related for that day, as well as some fixed (not date dependent) information about the counter. From the 'covidFrame' we create a new frame ('countries') selecting only the columns we need for the task at hand.

The we create a frame for each continent by aggrigation daily data for each country in a continent, (group by 'continent' and 'date'). Further, we created two computed columns for each continent ('total_death_per_million', 'total_cases_per_million'). Note: it is easier to recompute that for a continent rather than extract from the individual nation's column for the same.




Page - 7




Frame summary - continent

In this section we will show how to create summary of the world codiv date to create summary information for countries and continents.

We saw earlier the original 'covidFrame' contains a large quantity of daily covid information for every country. A row in that frame represented a country and date and information related for that day, as well as some fixed (not date dependent) information about the counter. From the 'covidFrame' we create a new frame ('countries') selecting only the columns we need for the task at hand.

Next we

Convert Into Utility Function

We are going to turn the operation above into a utility function that takes a country frame and and create a continent frame, where the entire continemt structurally looks like a single country as we demonstrated above. But the display can be misleading, since there som countries where the cases and deaths are so high that they skew the result for the other countries in the continent. One way of dealing with the problem is to peel away the country from its continents and to treat the country as a continent in its own right.

The easiest way to do that is to set the country

 
 
     
 
 
     


You will notice countries have a 'iso_code' and a 'location' namely the country (loca)

Covid situation in every continent

This is another roll up process where the dates are not important, only the largest calues of total_cases and total_deaths

 
 
     

Let's get a pie-chart

We will use the continentSummary frame to build pie charts from the data. Again we will be using the powerful but simple to use Plotly libray.

Create a function

Lets create a helper function to create a pie charts. This function will use the .rawColumn(colName) method to get a column from a frame as simple javascript array.

Just a note: _ rawColumn will get the values in the column in their format in the frame, if a column has numeric values but is string form, the will remain a string. Plotly will kindly convert strings to numbers, where numbers are required.

  • A frame with the data
  • some plotting options { opts } object
    • value - column name for the value
    • *
 
 
     

Plot Pie chart

Now plot some pie charts. The countries with the most cases are removed from their respective continents and are treated as if they were their own continents. So for example North America excludes the United States.

 
 
     




Page - 8




More Rollup

This time we will rollup data into countries and continents

  1. We have a frame for continents
  2. We will make continents frame have the same shape as the country data
  • We need to add 2 columns iso_code and location
  • Since these columns do not exist for continents frame, we will have to compute that
    • The simplest way is to use the same value as the continent column
    • We will also create a new coulmn mortality this value is computed by dividing total_deaths by total_cases

In the following section we will show you how to do that using the select method of a frame.

Countries and continent

The countries and continents are similar but not identical, so we cannot combine them into a single table.

  1. Continests frame do not have iso_code and location columns.
  2. We can fix that using frame.select
 
 
     

So What Happened

To select a few columns from a frame (including new ones)

  1. since there is no 'iso_code' for a continent let set it to 'continent'
  2. since there is no 'iso_code' for a continent let set it to 'continent'
  3. Add a new column mortality (%) total_deaths/total_cases - a computed value
 
 
     

Concating two frames

We can create a new frame by concatinating two frames, as long at they hace the same number and order of the columns. In the previous script we created

  1. A countries frame
  2. And a continents
  3. We made sure that the continents frame has 'iso_code' and 'location' columns

Now that we have the countries frame and the continents frame in the same shape we can combile the into one table using the concat method

 
 
     

Plot USA Covid data

Plot the data using the beautiful plotly library. So first thing we need is a plottting function based on the plotly.js library. This library has to do the following:

Utils for doing line plots

 
 
     

Plot Country Covid data

  • From the we have already created the country and continent data frame. THe frame consists of the following fields:
    • iso_code
    • location
    • date
    • total_cases
    • total_deaths
    • mortality
  • Creating continent wide data

Data Frames have a powerful aggrigation capability in frame.groupBy, this capability is built on an similar capability in SQL, namely groupby. The key idea here is to collect into groups all rows that have the sane value in a column, or a set of columns. On each group we can the apply come aggrication operation on some of thr remaining columns.

 
 
     

Some UI Test

 
 
     




Page - 9




More Covid Frame

Show the size of the frame and the column names

  • Length of the frame (covidFrame.length)
  • All column Names (covidFrame.columns)
  • List the columns that are mostly numeric (covidFrame.numericColumns)
 
 
     

Show the data summary as a Frame

In order to create the frame we need to do the following

  1. Function to count the number of non-empty values in a column countNonEmptyCells(columnName)
  2. Function to get the type of the column Numeric or Alpha columnType(colName)
 
 
     




Page - 10




Get rid of unnecessary columns

In sql we can do this using select columns from table, further the select operation can rename the columns, add new columns, and transform data is columns

We use a similar operation for a data frames using the project

         frame.project([ list of columns])
  • We can change a column name with the following code, change location to country
     [..., "location=country",...]

We also can reorder the columns by chosing the order of the columns, the example below shows how to create a new frem from a subset of the original columns and renaming a column

Note We can also add new columns by adding a new column name

Removing data redundency

In the covid table there is a great deal of data redendency and unnecessary data for example for each row has a iso_code for each country, country name, and continent. In database terminology we say that the data is not normalized. In this section we will show how we can normalize the data.

This section we will look at how to create the following:

  1. Create frames with only some of the columns
  2. Create frames to remove redundent data.
  3. How to combine data from two frames

But first lets fill some missing data

  • iso_code is the unique country code
  • Some locations are not countries so they do not have a iso_code or a continent
  • for thos missing value let us set the iso_code and continent to the value of location
 
 
     

Lets fix those values

Looking at the results notice the following:

  1. International does not have an iso_code or a continent
  2. World does not have a continent

We can fill those values with the value in location. We can do that will the code below:

 
 
     

Select a subset of columns

Select the most important columns using Frame.select method, in this prosess we can:

  1. Rename columns - 'location=country' _rename location to country
  2. Create new columns - ['INDEX', (value, row) => row.__index$] -- create and index column

So the following will create a ne Frame object with only the selected columns, and also the extra INDEX column we have created.

Note: sample(precent) is s random true,false genrtator (function) that will return true on a percent of the calls. This is very useful for sampling the data from a large frame. cFrame

 
 
     

Country Unique Values

In this there columns in the original data that remains the same for a country and does not change by date (for this dataset). So we can use the Frame.groupBy method to create a frame that

 
 
     

Analyze Columns

But first we need to find how many unique values in a column, using a .groupBy operation. The gb.uniqueCount accumulator does not exist. So this section will demonstrate how to use the gb.make method to create our own accumulator. First, the accumulator methods such as gb.count or gb.max are the same as a Array.reduce operation with a small twist.

Consider performing max of an array using reduce:

 
 
     

As you can see this is not what we wanted, we have the sum and the count, but not the average. We need to perform one final operation to get our desired value.

 
 
     

Typically after a reduce operation we may need to do some finalization operation, and a reducer function will need to do this. We do thsi by providing an action parameter to the reducer:

Action Description
1 perform accumulate action
2 perform finalize action


The format for the reducer (for .groupBy accumulators) is as follows:

function average(action,accum, count, value) {
    if(action === 2) return [accum/count, count]; // finalize the result

    if(!isNumber(value)) return accum;            // ignore non-numeric
    return [accum+v, count+1]                     // add value to accum, increment count              
}

The reducer on its own is not enough, we need to wrap it in functionality yo make is a groupBy accumulator there is a handy utility to do just that:

gb.make(reducer_func, initial_value_func, optional_initial_count )

to make a gb. like func we do the following:

   let gbAvg = gb.make(average, () => 0);

   // now we can use it in a frame

Having learnt about gb accumulators

 
 
     

Count unique values for all columns

We are going to use uniqueCount to count the unique values for all the

 
 
     

Using transpose to flip the fame

The transpose of a frame is analogous to the transpose of a matrix or a spreadsheet block. We have the columns into row are roes into columns. The only twist to is is to choose the columns to use as the headings.

 
 
     
 
 
     

Common coutry columns

Finall if we put everything together we get the following

 
 
     

Get data for Brazil

Here we will use a simpler version of select that keeps all the columns as is and filters out some of the rows. This is the same as the following sql

 select * from covidFrame where location = 'Brazil' 
 
 
     




Page - 11




Split COVID frame

Country frame

Create a frame with general country information country information into another frame

When we look at the original covidFrame there is a lot of duplication of data. For example, each row of d

 
 
     

Summary of cases by country

  • Use the Frame.groupBy to group all data by country and continent
  • For each group (country) do the following * Get maximum of total_deaths and give the column the name deaths * Get count of number of rows (using the 'INDEX' colums) and call the column count, Note: count the number of non empty values * Get maximum of total_cases and call the column cases * Get maximum of population column, since this value does not change we can just use it as a part of the grouping operation
 
 
     

Lets try to improve the display of numbers

Frame supports number formattion function.

  • Set a global formatting function for numeric data Frame.HTMLFormat.number = toStr
  • toStr(s: string) takes a string (or number) and returns a fromatted numeric string with commas add to make is easier to read
  • This does not alter the data in the frame
 
 
     
  • As you can see the numbers are nicely formatted, Note: this is for display only, the data in the frame is not changed.

  • Please feel free the edit the code and




Page - 12




Summarize the information by continent

This section will reformat the data by continent to create summary by continent

  • Group By continent
 
 
     

Transpose

Transpose will rotate the table, the column name comes from the row in the country column. Pay special attention the 'country' column is call the transpose pivot and it is important all the values in this column are unique. If the values are not unique the result of the transpose will look very weird and may violate the requirement that all the column names rae unique.

 
 
     

Sample 1% (0.01) of the data and show the first 20

  • Previously we created cFrame from covidFrame just keeping the most important data
 
 
     
  • The code / display above demonstrates how to display a sample of the data

Create some utility function to round a numeric string

This section will show you how to display large numbers in a human friendly format

  • Since all data is stored as strings or number, empty values are represented ny the empty string
  • The Math.round function will convert it to a numeric round the value convert back to a string
 
 
     

Make sure no numeric column has empty values

To do this we use the frame.update method, this method take an update object as a parameter

  • We need to find columns that are mostly numeric '90%' or empty, frame.numericColumns will get us the list of those column names

  • To apply a update function on a column we use the Frame.update method.

  • This method takes a mapping objec as a paremeter

  • The mapping object has as the key the coulmn name and value is the function to apply to the coulmn

  • The function takes two values mappingFunc1(columnValue, rowObject) * The result of the function is set to the column

  • The mapping obj should look like

{
    col1: forceZero,
    col2: forceZero,
    ...
}
  • The columns are the numeric columns of the frame, and we can get that cFrame.numericColumns
 
 
     

What happened

This create and update mapping object. Let us apply this to update the frame (as always this will create a new frame)

 
 
     

Now I will show you a rather useful function in the DataFrame library

 **Before** we do this let us introduce some useful utility function in the DataFrame package __arrProd__. So let look at what we need to do: 
 
 
     

So using the arrProd becomes

 
 
     

Create a summary of covid data

 
 
     

Normalizing data

As you will notice in the covidFrame there is a lot of redundent data, for example most of the country data is repeated on ever row for thar country and there is some missing data for example iso_code is not present for every locaton. Since we would like to use the iso_code as the unique key for every country, let us find the locations with missing iso_code

 
 
     



Page - 13