Skip to main content
Big Data Test Infrastructure (BDTI)

Data visualisation

Description

Use case description - Problem statement

As data scientist, I need to extract insights and visualizations from the provided dataset to support the storytelling of the "energy balance in the EU" report.

Use case goals

  • Load the data
  • Query the data
  • Explore the different visualization opportunities
  • Select the best-fit visualization techniques
  • Create the visualizations

Tools & capabilities

In order to meet the use case goals, the following tools from the portal will be leveraged:

Tool Description Key capability
PgAdmin PgAdmin is the most popular and feature rich Open Source administration and development platform for PostgreSQL, the most advanced Open Source database in the world. Data load
PostgreSQL PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance. Data query
Apache Superset Apache Superset is a modern data exploration and visualization platform. It is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts. - Data pre-processing (null values, outbound values, ...)
- Explore the different visualization opportunities

- Select the best-fit visualization technique

- Create the visualizations

Dataset:

Energy Balance in Europe | Kaggle

Guide

This document is meant to guide the user through the use case 1 - data visualisation. As presented in the use case description, the goal is to provide relevant visualizations through Apache Superset, having the data stored on a DB in PostgreSQL as source. The guide will be a step by step tutorial towards such objective. More in detail, each subsection covers a step of the approach, namely:

  1. Step 1: Initialize the resources.
  2. Step 2: Configure pgAdmin and connect the PostgreSQL instance.
  3. Step 3: Load the data.
  4. Step 4: Configure Apache Superset and connect the PostgreSQL database.
  5. Step 5: Create a dashboard and populate it with charts.

Step 1: Initialize the resources

As first step, the user should inizialize the required resources. More in particular, three instances should be launched:

  • PgAdmin instance
  • PostgreSQL instance
  • Apache Superset instance

Initialize the pgAdmin instance

  1. Go on the Service Catalog section of the Portal.
  2. Click on the button Launch on the pgAdmin4 badge.
  3. Assign a name to the instance and select a group from the ones available in the list.
  4. Select the default configuration.
  5. Set your pgAdmin email and copy the auto-generated password. This will be needed to access the instance in the later stage. (NB: Instance credentials are automatically saved and accessible on the My Data section of the portal).
  6. Launch the instance by clicking on the launch button.

Initialize the PostgreSQL instance

  1. Go on the Service Catalog section of the Portal.
  2. Click on the button Launch on the PostgreSQL badge.
  3. Assign a name to the instance and select a group from the ones available in the list.
  4. Select the micro configuration.
  5. Copy the auto-generated password. This will be needed to access the instance in the later stage. (NB: Instance credentials are automatically saved and accessible on the My Data section of the portal).
  6. Once the instance is deployed, it will be possible to copy its host address by clicking on the related button Copy in the My Services section of the portal.

Initialize the Apache Superset instance

  1. Go on the Service Catalog section of the Portal.
  2. Click on the button Launch on the Apache Superset badge.
  3. Assign a name to the instance and select a group from the ones available in the list.
  4. Select the micro configuration.
  5. Set your Admin username, Admin email, Admin firstname and Admin lastname. Copy the auto-generated password. This will be needed to access the instance in the later stage. (NB: Instance credentials are automatically saved and accessible on the My Data section of the portal).
  6. Launch the instance by clicking on the launch button.

After having launch the instances, go on the My Services section of the portal to verify that all the deployed services are up and running (all three instances should be in status ACTIVE).

BDTI service deployment

Step 2: Configure pgAdmin and connect the PostgreSQL instance

  1. From the My Services section of the portal, click on the Copy button of the PostgreSQL instance to copy the host address of the instance in the clipboard.
  2. The PostgreSQL host address is in the format:

        <postgreSQLHost>:5432
    
  3. Go on the My Services section of the portal and open the pgAdmin instance.

  4. Login into your pgAdmin account whith the credentials defined in the configuration.
  5. In the main page, click on the Browser tab: this will allow you to browse on all available resources on your pgAdmin account.
  6. On the Browser tab, right click on the Servers item > Register > Servers... . This will open up the configuration modal for registering a new server on our pgAdmin account: the PostgreSQL database will be added.
  7. The configuration modal contains five fields that need to be compiled:

    • Hostname/address: this is the hostname/address of the PostgreSQL instance. Input here the <postgresSQLHost> (see Step 2.2)
    • Port: 5432
    • Maintenance Database: postgres
    • Username: postgres
    • Password: input the PostgreSQL password defined in the configuration.
Service deployment

Step 3: Load the data

In order to load the data into the PostgreSQL database, we need to perform some operations on pgAdmin. Please note that pgAdmin is the management interface for PostgreSQL, so we need to use this service to interact with our DB on PostgreSQL. Finally, through pgAdmin we will:

  • Create the data table
  • Define the data schema for the table
  • Import the data from the source file

Create the data table

  1. In the Browser section, go to the Schemas and expand the public schema (postgres>Databases>Schemas).
  2. In the public schema, right-click on Table>Create>Table.. . This will open up the configuration modal for creating a new table on our public schema: in specific, the data table will be created.
  3. In the configuration modal, assign a name to the table (i.e. energy_balance_eu) and click on save.

Define the schema for the data table

Please note that to import data from a csv file it is preliminarly needed to manually define the related data schema (namely, to add the columns of the table). This is achieved by following the steps:

  1. Right-click on the newly created table and click on Query Tool. This will open an SQL query tool (Query Editor) that will allow to perform queries on the database. Hence, the SQL query tool will be used to alter the current data schema and to add the columns present in our data source.
  2. Paste the following SQL query in the Query Editor:
ALTER TABLE public.name_of_your_table
ADD COLUMN nrg_bal text,
ADD COLUMN siec text,
ADD COLUMN unit text,
ADD COLUMN geo text,
ADD COLUMN TIME_PERIOD int,
ADD COLUMN OBS_VALUE float;

NB: Make sure to substite name_of_your_table with the table name defined in the previous step.

  1. Execute the SQL query by clicking on the Play button (or by clicking F5)
  2. Verify that the columns has been created by checking the table properties (right click on the table name > Properties > Columns).

Import the data from the source file

  1. Download the data source csv file from here
  2. In the navbar, go on File > Preferences > Storage > Options and set the Maximum file upload size (MB) to 10000. Click on Save.
  3. Right-click on the newly created table and click ok Import/Export Data.... This will open an Import/Export Data tool that will allow us to import the data from the source file.
  4. Select the Import option.
  5. In the File Info section, click on the folder icon. This will open the Select file modal.
Import/export
  1. In the navbar of the Select file modal, click on the Upload file icon.
Import data
  1. Drop the energy_balance_eu.csv file and wait for its uploading to be done.
  2. The uploaded file should be now visible in the list of available files. Click on it and press Select.
  3. In the File Info section, select csv as Format.
  4. In the Miscellaneous section activate the Header option. This will allow PgAdmin to take into account that the first row of the file contains the table headers (title of the columns).
  5. In the Miscellaneous section, select the comma (,) as Delimiter.
  6. Click on Ok and wait for the import process to be done.

Step 4: Configure Apache Superset and connect the PostgreSQL database

Add the database to Apache Superset

  1. Go on the My Services section of the portal and open the Apache Superset instance.
  2. Login into your Apache Superset instance whith the credentials defined in the configuration.
  3. In the navbar, click on Data > Databases.
  4. Click on the " + DATABASE " button. This will open the Connect a database tool that will allow us to connect the PostgreSQL database to Apache Superset.
  5. Select PostgreSQL as database to connect.
  6. Enter the required PostgreSQL credentials:
    • HOST: this is the hostname/address of the PostgreSQL instance. Input here the <postgreSQLHost> (see Step 2.2)
    • PORT: 5432
    • DATABASE NAME: postgres
    • USERNAME: postgres
    • PASSWORD: input the autogenerated PostgreSQL password. You can retrieve it from the My Data section of the portal.
  7. Click on Connect.

Add the dataset to Apache Superset

  1. In the navbar, click on Data > Datasets.
  2. Click on the " + DATASETS " button. This will open the Add dataset tool that will allow us to connect the dataset from the PostgreSQL database to Apache Superset.
  3. In the DATABASE field, select the database that has been connected to Apache Superset in the previous step.
  4. In the SCHEMA field, select public.
  5. In the SEE TABLE SCHEMA field, select the table that contains the data imported in the previous steps (i.e. energy_balance_eu).
  6. If the dataset has been connected succesfully, it should be visible in the list of available datasets.
  7. For the newly connected dataset, in the Actions section press on the Edit icon.
  8. Go on COLUMNS and tick the property Is temporal and Default datetime for the column time_period. This will allow Apache Superset to deal with such column as a temporal dimension, and this will be needed for all time-related data visualizations (eg. bar charts, time series, ...).
  9. Save the changes by clicking on the Save button.

Step 5: Create a dashboard and populate it with charts

Create a Pie chart/Bar chart/Box plot

  1. In the navbar, click on Charts and then on the button + CHART to create a new chart.
  2. In the Choose a dataset section, select the dataset connected in the previous step.
  3. In the Choose chart type section, select the Pie Chart/Bar chart/Box plot and click on CREATE NEW CHART.
  4. Assign the new chart a name by editing the text Add the name of the chart.
  5. In the DATA tab, select time_period as TIME COLUMN.
  6. In the DATA tab:
    • Select geo as GROUP BY (for the pie-chart)
    • Select geo as SERIES (for the bar chart and box plot)
    • Select Original value for the TIME GRAIN (for the box plot).
  7. In the DATA tab, click on Add metric > SIMPLE and select obs_value as COLUMN and SUM as aggregate. Click on SAVE.
  8. In the DATA tab, click on Add filter > SIMPLE, select geo as column, Not equal to as operator and EA19 as Filter value. (Repeat the exact same operation twice by putting EU27_2020 and EU28 as filter value. This will exclude these columns from the chart, since these rappresents the aggregated values at EU level).
  9. Click on RUN and then on SAVE to save the created chart.
Superset

Please note that you can also generate alternative pie charts/bar charts by grouping for nrg_bal/siec /unit by selecting these at step 7, or by creating different aggregations at step 8. In addition to this, you can customize the chart options by clicking the CUSTOMIZE tab.

Create a World map chart

  1. In the navbar, click on Charts and then on the button + CHART.
  2. In the Choose a dataset section, select the dataset connected in the previous step.
  3. In the Choose chart type section, select the World map chart and click on CREATE NEW CHART.
  4. Assign the new chart a name by editing the text Add the name of the chart.
  5. In the DATA tab, select time_period as TIME COLUMN.
  6. In the DATA tab, select geo as COUNTRY COLUMN and code ISO 3166-1 alpha-2 (cca2) as COUNTRY FIELD TYPE.
  7. In the DATA tab, select geo as DIMENSIONS.
  8. In the DATA tab, click on Add metric > SIMPLE and select obs_value as COLUMN and SUM as aggregate. Click on SAVE.
  9. In the DATA tab, click on Add filter > SIMPLE, select geo as column, Not equal to as operator and EA19 as Filter value. (Repeat the exact same operation twice by putting EU27_2020 and EU28 as filter value. This will exclude these columns from the chart, since these rappresents the aggregated values at EU level).
  10. Click on RUN and on SAVE to save the created chart.
World chart map

Create a dashboard and populate it with charts

  1. In the navbar, click on Dashboards.
  2. Click on the " + DASHBOARD " button. This will create a new dashboard that will contain our data visualizations.
  3. Assign a name to the dashboard by editing the text [untitled dashboard].
  4. On the right part of the dashboard page, in the CHARTS tab drag and drop the desired charts from the list of the available ones.
World chart map dashboard