When running several different hive instances, we found the process of maintaining/confirming synchronization to be quite time consuming. While several tools made available by Cloudera and other frameworks do provide a visual interface for an individual instance of hive, there was no available tool for comparing across clusters. We came up with a useful way to compare hive schemas between clusters.
Our process contains 3 main steps:
- Fetch the schema from all hive environments, store the schema as a dictionary in a file.
- Create a class to perform comparisons leveraging python list, dict and pandas dataframe structures.
- Display the results/visualize the changes using a web interface (this approach uses python Flask).
Step 1- Fetch the Schema From all Environments
To fetch the schema we use the hive metastore api via the hive-thrift.py package.
First, we create a socket to connect to the hive thrift server.
transport = TSocket.TSocket(host, port)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
global metastore_client
metastore_client = ThriftHiveMetastore.Client(protocol)
transport.open()
Once that is done, it’s straightforward to get all of the tables in a db (there are lots of other great methods that have been documented).
db = 'myDatabase'
tables = metastore_client.get_all_tables(db)
Once we get the tables, we can call the get_fields() method
for table in tables:
for field in metastore_client.get_fields(db, table):
# field is a named tuple FieldSchema(comment, type, name)
print field
Using the methods above, we retrieve all of the values and store them in a dictionary, locally in a text file, one file per environment.
{'myDatabase': {
'tbl_customer':
[
FieldSchema(comment=None, type=int, name='user_id'), FieldSchema(comment=None, type='string', name='first_name'), FieldSchema(comment=None, type='string', name='last_name')
],
'tbl_orders':
[
FieldSchema(comment=None, type=int, name='order_id'), FieldSchema(comment=None, type='string', name='item_id'), FieldSchema(comment=None, type='string', name='price')
],
etc..
Note that thrift gives us a tuple for the ddl, so for ease, we declare this named tuple locally.
FieldSchema = collections.namedtuple(‘FieldSchema’, [‘comment’,’type’, ‘name’])
Step 2 – Create Class to Compare Dictionaries
Now that we have a separate schema file for each hive instance, we can load these files into dictionaries and begin to compare them.
f_stage_dict = open('schema_files/stage_schema_thrift.out')
f_prod_dict = open('schema_files/prod_schema_thrift.out')
d_stage = eval(f_stage_dict.read())
d_prod = eval(f_prod_dict.read())
Create a class to compare the dictionaries. Comparisons are done leveraging a variety of nice python tools (lists, set comparison, pandas dataframes, etc..). More detail in the repo.
Class DictCompare:
def get_fields_by_table(self, db, table):...
# returns list of fields for tableA and tableB
def get_databases(self):...
# union of all dbs in all schemas A,B
def compare_fields(self, db, table)...
# returns tablediff tuple showing all field differences
def compare_dbs(self, db)...
#returns list of differences between dbs
tables_a = [table for table in self.dict_a[db]]
tables_b = [table for table in self.dict_b[db]]
db_diffs_a = set(tables_a).difference(set(tables_b))
db_diffs_b = set(tables_b).difference(set(tables_a))
.....
Instantiate the class by passing the path of the schema files, and the friendly names used in the config.json file.
dc = sc.DictCompare(schema_files_path,”dev”, “stage” )
dbs = dc.get_databases()
# show all of the databases..
print "DBS=",dbs
# get all diffs for a given database
compare_dict = dc.compare_dbs("my_db")
Step 3 Display the Results / Visualize the Changes
I used a quick and simple python flask website to display the differences. Note there is a form on the page that allows users to select the hive instances to compare.
@app.route('/home.html')
@app.route('/home')
def homepage():
form=EnvForm(csrf_enabled=False)
env_a = form.env_a.data # 'stage', for example
env_b = form.env_b.data # 'prod', for example
dc = sc.DictCompare(schema_files_path,env_a, env_b )
dbs = dc.get_databases()
return render_template("home.html", dbs=dbs, a_name =dc.a_name, b_name = dc.b_name,form =form )
Source code (open sourced under Apache2 license):
I Look forward to any comments or feedback.
Discover more about our expertise in Big Data and Infrastructure technologies.
No comments