How to Build Highly Effective Data Science Workflows

Data science workflows typically look like this

Image of a complex data science workflow
To implement this workflow, many data scientists write code that chains together several functions and execute it linearly. While quick, it likely has many problems:  
  • it doesn't scale well as you add complexity
  • you have to manually track which functions were run with which parameters
  • you have to manually track where data is saved
  • it's difficult for others to read

Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them.  https://github.com/d6t/d6tflow is a free open-source library which makes it easy for you to build highly effective data science workflows. See github example to learn more.

Learn More
Get started
Does you write data science code looks like this? Don't do it! There's a better way.

import pandas as pd
import sklearn.svm, sklearn.metrics

def get_data():
    data = download_data()
    data = clean_data(data)
    data.to_pickle('data.pkl')

def preprocess(data):
    data = apply_function(data)
    return data

# flow parameters
reload_source = True
do_preprocess = True

# run workflow
if reload_source:
    get_data()

df_train = pd.read_pickle('data.pkl')
if do_preprocess:
    df_train = preprocess(df_train)
model = sklearn.svm.SVC()
model.fit(df_train.iloc[:,:-1], df_train['y'])
print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1])))

Questions?

To learn more about the DataBolt tools and products that help you accelerate data science, check out www.databolt.tech

To see other blog posts check out our archive at blog.databolt.tech.

For questions and feedback email us at support@databolt.tech

Share
Tweet
Forward
Copyright © 2019 www.databolt.tech, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Email Marketing Powered by Mailchimp