Why the Python data model could be for you

Posted in: Open Source, Technical Track

 

A great feature of Python is its simplicity. This simplicity allows developers to start coding immediately with very little formality. Python is also very good at doing ‘quick and dirty’ tasks. The result of this ability to get up and running with Python so quickly leads to many programmers using Python without necessitating a deeper understanding of the language. In many cases this is fine, but a deeper dive has some nice rewards.

Named Tuples
Named Tuples are a nice lightweight construct that makes things easier when dealing with data that might otherwise be stored in a regular tuple or dict.

In a recent project we encountered named tuples when returning table schema from Hive Thrift API:

# Define the fieldschema tuple
FieldSchema = collections.namedtuple('FieldSchema', ['comment','type', 'name'])


# Create some fields for a table schema
customer_table=[] customer_name = FieldSchema('customer name', 'varchar(100)','name')
customer_addr = FieldSchema('address', 'varchar(100)','addr1')
customer_dob = FieldSchema('birthdate ', 'int','dob')
last_updated = FieldSchema('date last updated', 'datetime','last_updated')


# create two customer tables that have slightly different schemas
customer_table_a=[customer_name, customer_addr, customer_dob] customer_table_b = [customer_name, customer_addr, customer_dob, last_updated]

The Python Data Model
In Python, everything is an object and each object has an identity, a type and a value. Built in methods like id(obj) returns the object’s identity and type(obj) returns the object’s type. The identity of an object can never change. Objects can be compared with the “is” keyword. Example: X is Y.

Special or “dunder” methods.
Special or ‘Magic’ method names are always written with leading and trailing double underscores (i.e., __getitem__). So when you use the syntax myobj[key] it is really calling the __getitem__ special method behind the scenes. So the interpreter calls obj.__getitem(key) in order to evaluate myobj[key]. Because of the leading and trailing double underscores, these methods are sometimes referred to as ‘dunder’ (double-underscore) methods. These ‘dunder’ methods are invoked by special syntax. For example using the index [] on a collections object invokes the __getitem__(key) special method. Example:

my_list = ['a','b','c'] print my_list[2] print my_list.__getitem__(2)
> c
> c

Generally, these ‘dunder’ methods are meant to be used by the interpreter, and not by you. But the advantage of knowing something about special methods is that it makes class design more fluid. Here’s an example where we override the __sub__ method (subtract) so we can compare dataframes by subtracting one from another:

import pandas as pd
class HiveTable:
def __init__(self,tabledef):
self.df = pd.DataFrame(tabledef)

def __sub__(self, other):
oj = other.df.merge(self.df, on=”name”, how=”outer”)
df_diff = oj[(pd.isnull(oj[“type_x”])) | (pd.isnull(oj[“type_y”]))] return df_diff

# Create Hive Tables
t = HiveTable([customer_name, customer_addr, customer_dob])
t2 = HiveTable([customer_name, customer_addr, customer_dob, last_updated])

# Compare them by ‘subracting’ them using the “-” operator.
t-t2

>>

comment_x
type_x
name
comment_y
type_y
3
date customer last updated
datetime
last_updated
NaN
NaN

 

Conclusion
While Python is easy to get started with, it is also ‘deep all the way down’, continues to be interesting and reward with a deeper understanding. By implementing special methods in your own classes, your objects can behave more like the built in ones, allowing for ‘pythonic’ coding for your own classes.

refs:
https://docs.python.org/2/reference/datamodel.html,
Fluent Python by Luciano Ramalho

 

 

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

David Salmela has 15+ years of experience in technology, consulting and team building, with a passion for products and solutions that enable data-driven decisions. By leveraging a combination of business and technical skills, he helps teams design and build systems and analytics products to optimize marketing processes and extract valuable information from their data. Through his work at Beats Music/Apple, David recently released an open source cross-cluster Hive Schema comparison tool. When he's not working, David enjoys spending time with his children, playing music with his friends and brutal never-ending tennis matches with his rivals.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *