Accelerate Data Engineering

d6tjoin - Quickly Load Any Type of CSV or Excel Data

Vendors often send large datasets in multiple files. Often there are missing and misaligned columns between files that have to be manually cleaned. With DataBolt File Stack you can easily stack them together into one consistent dataset.

Features include:

  • Quickly check column consistency across multiple files
  • Fix added/missing columns
  • Fix renamed columns
  • Check Excel tabs for consistency across files
  • Quickly extract data from messy Excel files into clean CSV data
  • Out of core functionality to process large files
  • Export to pandas, CSV, SQL, parquet
See jupyter notebook

 

import d6tstack.combine_csv as d6tc
c = d6tc.CombinerCSV(cfg_fnames, all_strings=True) # all_strings=True makes reading faster
col_preview = c.preview_columns()
print('all columns equal?', c.is_all_equal())
print('')
print('which columns are unique?', col_preview['columns_unique'])
print('')
print('which files have unique columns?')
print('')
print(c.is_col_present_unique())
all columns equal? False

which columns are unique? ['profit2']

which files have unique columns?

                                        profit2
filename                                       
test-data-input-csv-colmismatch-mar.csv    True
test-data-input-csv-colmismatch-feb.csv   False
test-data-input-csv-colmismatch-jan.csv   False
# quickly read all data

c.preview_combine() # keep all columns
  cost date filename profit profit2 sales
0 -100 2011-03-01 test-data-input-csv-colmismatch-mar.csv 200 400 300
1 -100 2011-03-02 test-data-input-csv-colmismatch-mar.csv 200 400 300
2 -100 2011-03-03 test-data-input-csv-colmismatch-mar.csv 200 400 300
0 -90 2011-02-01 test-data-input-csv-colmismatch-feb.csv 110 NaN 200
1 -90 2011-02-02 test-data-input-csv-colmismatch-feb.csv 110 NaN 200
2 -90 2011-02-03 test-data-input-csv-colmismatch-feb.csv 110 NaN 200
0 -80 2011-01-01 test-data-input-csv-colmismatch-jan.csv 20 NaN 100
1 -80 2011-01-02 test-data-input-csv-colmismatch-jan.csv 20 NaN 100
2 -80 2011-01-03 test-data-input-csv-colmismatch-jan.csv 20 NaN 100
 
c.preview_combine(is_col_common=True) # keep only common columns

Out[64]:
  cost date profit sales filename
0 -100 2011-03-01 200 300 test-data-input-csv-colmismatch-mar.csv
1 -100 2011-03-02 200 300 test-data-input-csv-colmismatch-mar.csv
2 -100 2011-03-03 200 300 test-data-input-csv-colmismatch-mar.csv
0 -90 2011-02-01 110 200 test-data-input-csv-colmismatch-feb.csv
1 -90 2011-02-02 110 200 test-data-input-csv-colmismatch-feb.csv
2 -90 2011-02-03 110 200 test-data-input-csv-colmismatch-feb.csv
0 -80 2011-01-01 20 100 test-data-input-csv-colmismatch-jan.csv
1 -80 2011-01-02 20 100 test-data-input-csv-colmismatch-jan.csv
2 -80 2011-01-03 20 100 test-data-input-csv-colmismatch-jan.csv
 
See jupyter notebook

Questions?

To learn more about the DataBolt tools and products that help you accelerate data engineering, check out www.databolt.tech

To see other blog posts check out our archive at blog.databolt.tech.

For questions and feedback email us at support@databolt.tech

Share
Tweet
Forward
Copyright © 2018 www.databolt.tech, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Email Marketing Powered by Mailchimp