Demystify Databricks

After drowning myself in the world of databricks in these few days, I've decided to create some docs and utils to make databricks programming bearable (esp for beginners) :bear:. I want to share these findings with you guys in case you think they are useful too :grin:.

datA source explorations

problems:

We (tvlk) have too many mount paths in our databricks dir. So, I attempted to make a table of base mount paths that I can use to read tvlk-data-datalake-prod data.
Databricks docs sucks. Info is jumbled everywhere. I have to look at the pyspark source code to find out how to do things.

solution: created a documentation that include summary of how to read / write data from different formats, shortcuts, link to docs, etc all in one place and make the doc tidy and human-readable.
notebook: https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/315759

dbfs_utils*

problems: searching directories is tedious and using %fs / dbutils.fs.ls over and over agaven is troublesome
solution: created utils functions to make directory searching & listing more efficient

Ability to search traveloka data datalake prod's directory

Eg: you want to find the event directory of final hourly and daily flight booking and flight search data in avro format. You can use:
search_tvlk_datalake_prod_directory(versions="final", time_granularity=["hour", "day"], file_formats="avro", keywords=["flight_booking", "flight_search"] )

Ability to get full path of the first or latest partitioned traveloka's event directory

Eg: you know the event directory but you aren't sure abt the first and the latest available files under that directory. You can use:
get_first_full_partitioned_path('/mnt/datalake-prod/traveloka/data/v1/final/avro/day_1/edw.fact_flight_booking/') or get_latest_full_partitioned_path(/mnt/datalake-prod/traveloka/data/v1/final/avro/day_1/edw.fact_flight_booking/)

notebook: https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/326780

only applicable to files mounted under /mnt/datalake-prod or /mnt/S3_*

secret_utils & DBFS_UTILS_ADMIN (admin only)*:

problems:

storing credentials as plain texts in databricks notebooks is obviously dangerous and make our data vurnerable to data breaches (~~turns data lake into data leak~~ :sad_pepe:)
we (tvlk) don't have a standardised way of storing credentials in databricks

solution: encrypt credentials with fernet cryptography, store the encrypted creds into a secret file in s3. Use the path to the secret file in notebooks to get credentials instead of using plain secret texts.

You only need to use two methods: create_secret(secret_text, path_to_secret) and get_secret(path_to_secret)

Note: with secret paths we eliminated tendency to put credentials as plain texts in notebooks. I'm aware it may not be the perfect solution (alas, we don't even have one rn) so I'm open to oiscussions / suggestions.
notebook: https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/333186

df_utils & TIME_UTILITIES:

problems: there are many functions that can be abstracted into methods (DRY)
solutions: created utils for extracting dataframe types, converting pandas to some types, get current time, etc
notebook:

df_utils:

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327213

time_utils:

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/316425

plot_utils*

problems:

matplotlib produces static plots only, we're unable to zoom or pan the plot without making changes to our plotting code
repeatedly importing plotting libraries and implementing the same plotting functions is tedious

solution: developed automated plotting tools for time series data with plotly that allow interactmvity (plotly enables zooming and panning actions through mouse). You now only need to call 1 method in your cell to plot a pandas df.
notebook: https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327470

currently supports time series plots only, will add more if there are requests.

TL; DR

I created some notebooks that contain documentations and utilities to make programming in databricks more efficient.

data_source_docs: one stop documentation related to exploring data in databricks

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/315759

dbfs_utils: allow you to search events directory & getting first / latest partition paths

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/326780
Command: %run /Users/deka.akbar@traveloka.com/utils/dbfs_utils

secret_utils (admin only): encrypting & decrypting credentials

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/333186
Command: %run /Users/deka.akbar@traveloka.com/utils/secret_utils

dbfs_utils (admin only): complements secret_utils

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/332078
Command: %run /Users/deka.akbar@traveloka.com/utils/dbfs_utils_admin

df_utils: spark and pandas dataframe explorations

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327213
Command: %run /Users/deka.akbar@traveloka.com/utils/df_utils

time_utils: common time functionalities

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/316425
Command: %run /Users/deka.akbar@traveloka.com/utils/time_utils

plot_utils: plot interactive time series data (able to zoom & pan)

https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327470
Command: %run /Users/deka.akbar@traveloka.com/utils/plot_utils