Tutorial: Analyze Data using Notebooks

This tutorial shows how to leverage Polypheny’s Notebook feature to execute code and query data, and even perform a basic yet insightful analysis using Python. We’ll focus on querying the relational public.emp data source shipped with Polypheny for this tutorial.

Creating a New Notebook

  1. Open Polypheny-UI: Navigate to the Notebooks section in the main menu.
  2. Deploy Jupyter Server container: Make sure that the Jupyter server container is deployed. If it shows offline, click on the green Start Container Button and follow the instructions. Note that this requieres Docker to be setup.
  3. Create New File: Click the + button in the left sidebar.
  4. Name Your Notebook: Provide a unique name for your notebook.
  5. File & Kernel Settings: Make sure the file type is set to Notebook and the kernel to Python 3 (ipykernel).
  6. Create: Click on Create.

Congratulations, you have just set up a new notebook! This action starts up a Python kernel and immediately opens your notebook, ready for action.

Familiarizing Yourself with the UI

When your notebook opens, you’ll see a toolbar at the top and your new notebook displayed below it. By default, you have one empty code cell highlighted by a blue bar. Each cell has its own mini toolbar offering cell-related functionalities.

Markdown Introduction

  1. Switch to Markdown: Convert the cell type to Markdown using either the notebook toolbar or the cell-specific toolbar.
  2. Edit Markdown Cell: Click inside the cell’s editor and enter the following Markdown text:
    # My First Notebook
    This notebook demonstrates how to perform a simple 
    analysis of some data stored in *Polypheny*.
    
  3. Render Markdown: Click on the button next to the query cell to render the Markdown text.

Further information on the supported markdown syntax can be found here.

First Python Code Cell

  1. Insert a Code Cell: Add a new code cell below your Markdown cell.
  2. Write & Execute Code: Enter print('Hello World') and run the cell. You should see “Hello World” displayed as output.
  3. Save Your Work: Don’t forget to save your progress by clicking on the “Save” button in the toolbar.

Querying Data from Polypheny using Query Cells

  1. Add a Query Cell: Insert a new cell, but this time set its type to Query.
  2. Query Settings: Ensure the language is set to SQL, the namespace to public, and specify employees as the variable name for storing the query results.
  3. Execute Your Query: Run the following SQL query to get started.
    SELECT * FROM emp
    ORDER BY workingyears DESC
    
  4. Inspect Results: You’ll now see the 10 employees with the most working years. The output is limited to 10 rows because no LIMIT clause was specified.

  5. Modify Query: Update your SQL query to focus on the working years and monthly income of all employees.
    SELECT monthlyincome, workingyears FROM emp
    LIMIT 2000
    
Since the data set is large, Polypheny will show only the first 1000 entries for performance reasons. If you want, you can hide the table display by clicking on the blue bar on the left edge of the result cell.

Advanced Analysis: Gaining Insights from Query Results

Now we will analyse the data using Pandas.

  1. Add a Code Cell: Create a new code cell bellow your query cell.
  2. Data to Dataframe: Insert the following code to manipulate your data:
    df = employees.as_df()
    print(f"Total Number of Employees: {len(df)}")
    
  3. Visual Analysis: To better understand the distribution of working years versus monthly income, add this code:
    df.plot.scatter(x='workingyears', y='monthlyincome', title='Scatter Plot of Working Years vs Monthly Income')
    
  4. Execute Analysis Cell: Run the cell to see the result. This code block performs several actions:

    • First, it converts the query result stored in the employees variable into a Pandas DataFrame.
    • It then prints the total number of employees for your quick reference.
    • Finally, it generates a scatter plot that plots ‘Working Years’ against ‘Monthly Income’, offering a visual way to perceive any trends or anomalies in the data.

Closing Your Notebook

  1. Navigate or Close: Either navigate away from the current notebook using the left sidebar or click on the close button.
  2. Save and Shutdown: A dialog will appear asking if you’d like to save your notebook and shut down the kernel. Ensure both are checked, as it’s a good practice not to leave unnecessary kernels running.
  3. Finalize: Click on “Close & Shut Down”.
  4. Kernel Status: Finally, confirm via the notebooks dashboard that the kernel has indeed stopped running.
© Polypheny GmbH. All Rights Reserved.