Comparing schemas between hive clusters

Posted in: Big Data, Technical Track

 

When running several different hive instances, we found the process of maintaining/confirming synchronization to be quite time consuming. While several tools made available by Cloudera and other frameworks do provide a visual interface for an individual instance of hive, there was no available tool for comparing across clusters. We came up with a useful way to compare hive schemas between clusters.

Our process contains 3 main steps:

  1. Fetch the schema from all hive environments, store the schema as a dictionary in a file.
  2. Create a class to perform comparisons leveraging python list, dict and pandas dataframe structures.
  3. Display the results/visualize the changes using a web interface (this approach uses python Flask).

 

Step 1- Fetch the Schema From all Environments
To fetch the schema we use the hive metastore api via the hive-thrift.py package.

First, we create a socket to connect to the hive thrift server.

transport = TSocket.TSocket(host, port)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)

global metastore_client
metastore_client = ThriftHiveMetastore.Client(protocol)
transport.open()

Once that is done, it’s straightforward to get all of the tables in a db (there are lots of other great methods  that have been documented).
db = 'myDatabase'
tables = metastore_client.get_all_tables(db)

Once we get the tables, we can call the get_fields() method
for table in tables:
for field in metastore_client.get_fields(db, table):
# field is a named tuple FieldSchema(comment, type, name)
print field

Using the methods above, we retrieve all of the values and store them in a dictionary, locally in a text file, one file per environment.
{'myDatabase': {
'tbl_customer':
[
FieldSchema(comment=None, type=int, name='user_id'), FieldSchema(comment=None, type='string', name='first_name'), FieldSchema(comment=None, type='string', name='last_name')
],
'tbl_orders':
[
FieldSchema(comment=None, type=int, name='order_id'), FieldSchema(comment=None, type='string', name='item_id'), FieldSchema(comment=None, type='string', name='price')
],

etc..

Note that thrift gives us a tuple for the ddl, so for ease, we declare this named tuple locally.
FieldSchema = collections.namedtuple(‘FieldSchema’, [‘comment’,’type’, ‘name’])

 

Step 2 – Create Class to Compare Dictionaries

Now that we have a separate schema file for each hive instance, we can load these files into dictionaries and begin to compare them.

f_stage_dict = open('schema_files/stage_schema_thrift.out')
f_prod_dict = open('schema_files/prod_schema_thrift.out')
d_stage = eval(f_stage_dict.read())
d_prod = eval(f_prod_dict.read())

Create a class to compare the dictionaries. Comparisons are done leveraging a variety of nice python tools (lists, set comparison, pandas dataframes, etc..). More detail in the repo.

Class DictCompare:
def get_fields_by_table(self, db, table):...
# returns list of fields for tableA and tableB
def get_databases(self):...
# union of all dbs in all schemas A,B
def compare_fields(self, db, table)...
# returns tablediff tuple showing all field differences
def compare_dbs(self, db)...
#returns list of differences between dbs
tables_a = [table for table in self.dict_a[db]] tables_b = [table for table in self.dict_b[db]] db_diffs_a = set(tables_a).difference(set(tables_b))
db_diffs_b = set(tables_b).difference(set(tables_a))
.....

Instantiate the class by passing the path of the schema files, and the friendly names used in the config.json file.
dc = sc.DictCompare(schema_files_path,”dev”, “stage” )
dbs = dc.get_databases()

# show all of the databases..
print "DBS=",dbs

# get all diffs for a given database
compare_dict = dc.compare_dbs("my_db")

 

Step 3 Display the Results / Visualize the Changes

I used a quick and simple python flask website to display the differences. Note there is a form on the page that allows users to select the hive instances to compare.

@app.route('/home.html')
@app.route('/home')
def homepage():
form=EnvForm(csrf_enabled=False)
env_a = form.env_a.data # 'stage', for example
env_b = form.env_b.data # 'prod', for example
dc = sc.DictCompare(schema_files_path,env_a, env_b )
dbs = dc.get_databases()
return render_template("home.html", dbs=dbs, a_name =dc.a_name, b_name = dc.b_name,form =form )

Source code (open sourced under Apache2 license):

I Look forward to any comments or feedback.

 

Discover more about our expertise in Big Data and Infrastructure technologies. 

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *