Comparing schemas between hive clusters

Posted in: Big Data, Technical Track

 

When running several different hive instances, we found the process of maintaining/confirming synchronization to be quite time consuming. While several tools made available by Cloudera and other frameworks do provide a visual interface for an individual instance of hive, there was no available tool for comparing across clusters. We came up with a useful way to compare hive schemas between clusters.

Our process contains 3 main steps:

  1. Fetch the schema from all hive environments, store the schema as a dictionary in a file.
  2. Create a class to perform comparisons leveraging python list, dict and pandas dataframe structures.
  3. Display the results/visualize the changes using a web interface (this approach uses python Flask).

 

Step 1- Fetch the Schema From all Environments
To fetch the schema we use the hive metastore api via the hive-thrift.py package.

First, we create a socket to connect to the hive thrift server.

transport = TSocket.TSocket(host, port)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)

global metastore_client
metastore_client = ThriftHiveMetastore.Client(protocol)
transport.open()

Once that is done, it’s straightforward to get all of the tables in a db (there are lots of other great methods  that have been documented).
db = 'myDatabase'
tables = metastore_client.get_all_tables(db)

Once we get the tables, we can call the get_fields() method
for table in tables:
for field in metastore_client.get_fields(db, table):
# field is a named tuple FieldSchema(comment, type, name)
print field

Using the methods above, we retrieve all of the values and store them in a dictionary, locally in a text file, one file per environment.
{'myDatabase': {
'tbl_customer':
[
FieldSchema(comment=None, type=int, name='user_id'), FieldSchema(comment=None, type='string', name='first_name'), FieldSchema(comment=None, type='string', name='last_name')
],
'tbl_orders':
[
FieldSchema(comment=None, type=int, name='order_id'), FieldSchema(comment=None, type='string', name='item_id'), FieldSchema(comment=None, type='string', name='price')
],

etc..

Note that thrift gives us a tuple for the ddl, so for ease, we declare this named tuple locally.
FieldSchema = collections.namedtuple(‘FieldSchema’, [‘comment’,’type’, ‘name’])

 

Step 2 – Create Class to Compare Dictionaries

Now that we have a separate schema file for each hive instance, we can load these files into dictionaries and begin to compare them.

f_stage_dict = open('schema_files/stage_schema_thrift.out')
f_prod_dict = open('schema_files/prod_schema_thrift.out')
d_stage = eval(f_stage_dict.read())
d_prod = eval(f_prod_dict.read())

Create a class to compare the dictionaries. Comparisons are done leveraging a variety of nice python tools (lists, set comparison, pandas dataframes, etc..). More detail in the repo.

Class DictCompare:
def get_fields_by_table(self, db, table):...
# returns list of fields for tableA and tableB
def get_databases(self):...
# union of all dbs in all schemas A,B
def compare_fields(self, db, table)...
# returns tablediff tuple showing all field differences
def compare_dbs(self, db)...
#returns list of differences between dbs
tables_a = [table for table in self.dict_a[db]] tables_b = [table for table in self.dict_b[db]] db_diffs_a = set(tables_a).difference(set(tables_b))
db_diffs_b = set(tables_b).difference(set(tables_a))
.....

Instantiate the class by passing the path of the schema files, and the friendly names used in the config.json file.
dc = sc.DictCompare(schema_files_path,”dev”, “stage” )
dbs = dc.get_databases()

# show all of the databases..
print "DBS=",dbs

# get all diffs for a given database
compare_dict = dc.compare_dbs("my_db")

 

Step 3 Display the Results / Visualize the Changes

I used a quick and simple python flask website to display the differences. Note there is a form on the page that allows users to select the hive instances to compare.

@app.route('/home.html')
@app.route('/home')
def homepage():
form=EnvForm(csrf_enabled=False)
env_a = form.env_a.data # 'stage', for example
env_b = form.env_b.data # 'prod', for example
dc = sc.DictCompare(schema_files_path,env_a, env_b )
dbs = dc.get_databases()
return render_template("home.html", dbs=dbs, a_name =dc.a_name, b_name = dc.b_name,form =form )

Source code (open sourced under Apache2 license):

I Look forward to any comments or feedback.

 

Discover more about our expertise in Big Data and Infrastructure technologies. 

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

David Salmela has 15+ years of experience in technology, consulting and team building, with a passion for products and solutions that enable data-driven decisions. By leveraging a combination of business and technical skills, he helps teams design and build systems and analytics products to optimize marketing processes and extract valuable information from their data. Through his work at Beats Music/Apple, David recently released an open source cross-cluster Hive Schema comparison tool. When he's not working, David enjoys spending time with his children, playing music with his friends and brutal never-ending tennis matches with his rivals.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *