d6tjoin - Quickly Merge Data with Fuzzy Joins
Joining datasets is a common data engineering operation. However, often there are problems merging datasets from different sources because of mismatched identifiers, date conventions, misspellings etc and you need to manually clean the data before joining. With d6tjoin you can easily identify join problems and quickly join datasets even if they don't perfectly match.
import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1,df2,fuzzy_left_on=['id'],fuzzy_right_on=['id'],exact_left_on=['date'],exact_right_on=['date']).merge().head(2)
# joined on the best match ids
date id val1 id_right val1_right val2
0 2010-01-01 e3e70682 0.020 3e7068 0.020 0.034
1 2010-01-01 f728b4fa 0.806 728b4f 0.806 0.849
d6tstack - Quickly Load Any Type of CSVs or Excel Data
Vendors often send large datasets in multiple files but often there are missing and misaligned columns between files that have to be manually cleaned. With d6tstack you can easily stack them together into one dataframe and quickly fix problems with missing, added or renamed columns.
import glob
from d6tstack.stack import combine_csv
>>> c = combine_csv.CombinerCSV(glob.glob('*.csv'))
# quick check if all files have consistent columns
>>> c.is_all_equal()
False
# show which files are missing columns
>>> c.is_col_present()
filename cost date profit profit2 sales
0 feb.csv True True True False True
2 mar.csv True True True True True
>>> c.combine_preview() # keep all columns
filename cost date profit profit2 sales
0 jan.csv -80 2011-01-01 20 NaN 100
0 mar.csv -100 2011-03-01 200 400 300
>>> c.combine_preview(is_col_common=True) # keep common columns
filename cost date profit sales
0 jan.csv -80 2011-01-01 20 100
0 mar.csv -100 2011-03-01 200 300
|