The Coeo Blog

Databricks Structured Streaming - Part 2 (Preparing the Data)

Written by Andy Mitchell | 04-Mar-2020 12:30:00

Continuing on from our previous posts about Databricks, we are now going to look at structured streaming. For this blog post we are going to use some data from the Seattle fire service that is updated every 5 minutes.

The source data we are going to use can be found at https://data.seattle.gov/Public-Safety/Seattle-Real-Time-Fire-911-Calls/upug-ckch

  1. Navigate to the link above and familiarise yourself with the type of data that we will be using

         
  2. Click on the SOURCE DATASET "Seattle Real Time Fire 911 Calls" to display information about the dataset

        
  3. Navigate back to the previous page and click on the "API" link

        
  4. Make a note of the URL in the link (copy to the clipboard)

        
  5. In a new browser window paste in the URL to view the content from the API

         
  6. Using the Databricks cluster created in the previous post

        
  7. Navigate to "Workspace"

        
  8. Navigate to "Shared" > "Create" > "Folder" 

        
  9. Enter the name "Introduction to Databricks Structured Streaming"  and click "Create Folder"

         
  10. Navigate to the folder and click "Create" > "Notebook"

         
  11. Create a Python notebook called "Part 2 (Preparing the data)"

        
  12. Enter the following code into the cmd pane of the notebook. These are the variables that we will use for getting the data from the website and saving it to a file.


        
  13. Get the URL for the data stream and paste it where <URL> is above, this can be found from the API button on the following web page
    https://data.seattle.gov/Public-Safety/Seattle-Real-Time-Fire-911-Calls/upug-ckch
         
  14. Add another cmd pane by clicking on the + that appears below the middle of the previous one and add the following code. This provides a function that we will use save the stream to a file

        
  15. Add another cmd pane and add the following code to test that the data is captured to a file

        
  16. To test the code you can either click the at the top of the notebook or run each cell individually using the play button in the top right. "Test the function" should return something similar to the following
        
  17. To check that we have some files you can run the above a few times, but leave 5 minutes between executions to allow the source data to refresh.
        
  18. Enter the following in another cmd pane to check that we have some files to process

We now have some data that we can use to test structured streaming.

In this post we have:

  • Created function to pull data from an external API
  • Created Json files that we can use as the source for our structured streaming dataset.

In the next post we will connect to the json files using structured streaming.