Copy
Accelerate Data Engineering

d6tjoin - Identify and analyze join problems

Joining datasets is a common data engineering operation. However, often there are problems merging datasets from different sources because of mismatched identifiers, date conventions etc.

d6tjoin.utils module allows you to test for join accuracy and quickly identify and analyze join problems.

Here are some examples which show you how to:

  • do join quality analysis prior to attempting a join
  • detect and analyze a string-based identifiers mismatch
  • detect and analyze a date mismatch
See jupyter notebook

 

import d6tjoin.utils
# Use Case: assert 100% join accuracy for data integrity checks

j = d6tjoin.utils.PreJoin([df1,df2],['id','date']) 
try:
    assert j.is_all_matched() # fails
except:
    print('assert fails!')

# Use Case: detect and analyze id mismatch
j.stats_prejoin()
 
  key left key right all matched inner left right outer unmatched total unmatched left unmatched right
0 id id False 0 10 10 20 20 10 10
1 date date True 366 366 366 366 0 0 0
2 __all__ __all__ False 0 3660 3660 7320 7320 3660 3660


 

See jupyter notebook

Questions?

To learn more about the DataBolt tools and products that help you accelerate data engineering, check out www.databolt.tech

To see other blog posts check out our archive at blog.databolt.tech.

For questions and feedback email us at support@databolt.tech

Share
Tweet
Forward
Copyright © 2018 www.databolt.tech, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Email Marketing Powered by Mailchimp