Demystify Databricks
After drowning myself in the world of databricks in these few days, I've decided to create some docs and utils to make databricks programming bearable (esp for beginners)
:bear:. I want to share these findings with you guys in case you think they are useful too
:grin:.
datA source explorations
- problems:
- We (tvlk) have too many mount paths in our databricks dir. So, I attempted to make a table of base mount paths that I can use to read
tvlk-data-datalake-prod
data.
- Databricks docs sucks. Info is jumbled everywhere. I have to look at the pyspark source code to find out how to do things.
- solution: created a documentation that include summary of how to read / write data from different formats, shortcuts, link to docs, etc all in one place and make the doc tidy and human-readable.
- notebook:
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/315759
dbfs_utils*
- problems: searching directories is tedious and using
%fs
/ dbutils.fs.ls
over and over agaven is troublesome
- solution: created utils functions to make directory searching & listing more efficient
- Ability to search traveloka data datalake prod's directory
Eg: you want to find the event directory of final hourly and daily flight booking and flight search data in avro format. You can use:
search_tvlk_datalake_prod_directory(versions="final", time_granularity=["hour", "day"], file_formats="avro", keywords=["flight_booking", "flight_search"] )
- Ability to get full path of the first or latest partitioned traveloka's event directory
Eg: you know the event directory but you aren't sure abt the first and the latest available files under that directory. You can use:
get_first_full_partitioned_path('/mnt/datalake-prod/traveloka/data/v1/final/avro/day_1/edw.fact_flight_booking/')
or get_latest_full_partitioned_path(/mnt/datalake-prod/traveloka/data/v1/final/avro/day_1/edw.fact_flight_booking/
)
only applicable to files mounted under /mnt/datalake-prod
or /mnt/S3_*
secret_utils & DBFS_UTILS_ADMIN (admin only)*:
- problems:
- storing credentials as plain texts in databricks notebooks is obviously dangerous and make our data vurnerable to data breaches (
turns data lake into data leak
:sad_pepe:)
- we (tvlk) don't have a standardised way of storing credentials in databricks
- solution: encrypt credentials with fernet cryptography, store the encrypted creds into a secret file in s3. Use the path to the secret file in notebooks to get credentials instead of using plain secret texts.
- You only need to use two methods:
create_secret(secret_text, path_to_secret
) and get_secret(path_to_secret)
- Note: with secret paths we eliminated tendency to put credentials as plain texts in notebooks. I'm aware it may not be the perfect solution (alas, we don't even have one rn) so I'm open to oiscussions / suggestions.
- notebook:
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/333186
df_utils & TIME_UTILITIES:
- problems: there are many functions that can be abstracted into methods (DRY)
- solutions: created utils for extracting dataframe types, converting pandas to some types, get current time, etc
- notebook:
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327213
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/316425
plot_utils*
- problems:
- matplotlib produces static plots only, we're unable to zoom or pan the plot without making changes to our plotting code
- repeatedly importing plotting libraries and implementing the same plotting functions is tedious
- solution: developed automated plotting tools for time series data with plotly that allow interactmvity (plotly enables zooming and panning actions through mouse). You now only need to call 1 method in your cell to plot a pandas df.
- notebook:
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327470
currently supports time series plots only, will add more if there are requests.
TL; DR
I created some notebooks that contain documentations and utilities to make programming in databricks more efficient.
- data_source_docs: one stop documentation related to exploring data in databricks
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/315759
- dbfs_utils: allow you to search events directory & getting first / latest partition paths
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/326780
Command: %run /Users/deka.akbar@traveloka.com/utils/dbfs_utils
- secret_utils (admin only): encrypting & decrypting credentials
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/333186
Command: %run /Users/deka.akbar@traveloka.com/utils/secret_utils
- dbfs_utils (admin only): complements secret_utils
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/332078
Command: %run /Users/deka.akbar@traveloka.com/utils/dbfs_utils_admin
- df_utils: spark and pandas dataframe explorations
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327213
Command: %run /Users/deka.akbar@traveloka.com/utils/df_utils
- time_utils: common time functionalities
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/316425
Command: %run /Users/deka.akbar@traveloka.com/utils/time_utils
- plot_utils: plot interactive time series data (able to zoom & pan)
https://dbc-e60bee69-a52a.cloud.databricks.com/#notebook/327470
Command: %run /Users/deka.akbar@traveloka.com/utils/plot_utils