Copy
DataBolt - Accelerate data engineering

New python libraries to automate tedious data engineering tasks


DataBolt has published two free open source libraries which automate common data munging tasks. More details below, to start using them see our github page.

Get it on Github

 

d6tjoin - Quickly Merge Data with Fuzzy Joins

Joining datasets is a common data engineering operation. However, often there are problems merging datasets from different sources because of mismatched identifiers, date conventions, misspellings etc and you need to manually clean the data before joining. With d6tjoin you can easily identify join problems and quickly join datasets even if they don't perfectly match.
 

import d6tjoin.top1 
d6tjoin.top1.MergeTop1(df1,df2,fuzzy_left_on=['id'],fuzzy_right_on=['id'],exact_left_on=['date'],exact_right_on=['date']).merge().head(2)

# joined on the best match ids
        date        id   val1 id_right  val1_right   val2
0 2010-01-01  e3e70682  0.020   3e7068       0.020  0.034
1 2010-01-01  f728b4fa  0.806   728b4f       0.806  0.849


d6tstack - Quickly Load Any Type of CSVs or Excel Data
Vendors often send large datasets in multiple files but often there are missing and misaligned columns between files that have to be manually cleaned. With d6tstack you can easily stack them together into one dataframe and quickly fix problems with missing, added or renamed columns.

import glob
from d6tstack.stack import combine_csv
>>> c = combine_csv.CombinerCSV(glob.glob('*.csv'))

# quick check if all files have consistent columns
>>> c.is_all_equal()
False

# show which files are missing columns
>>> c.is_col_present()
   filename  cost  date profit profit2 sales
0  feb.csv  True  True   True   False  True
2  mar.csv  True  True   True    True  True

>>> c.combine_preview() # keep all columns
   filename  cost        date profit profit2 sales
0   jan.csv  -80  2011-01-01     20     NaN   100
0   mar.csv  -100  2011-03-01    200     400   300

>>> c.combine_preview(is_col_common=True) # keep common columns
   filename  cost        date profit sales
0   jan.csv  -80  2011-01-01     20   100
0   mar.csv  -100  2011-03-01    200   300
Get it on Github

Questions?

To learn more about the DataBolt tools and products that help you accelerate data engineering, check out www.databolt.tech

To see other blog posts check out our archive at blog.databolt.tech.

For questions and feedback email us at support@databolt.tech

Share
Tweet
Forward
Copyright © 2018 www.databolt.tech, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Email Marketing Powered by Mailchimp