Author: Danil Zburivsky

3 key data analytics announcements from Google NEXT 2019

Google NEXT 2019 was a momentous event for the Pythian team. We celebrated winning the Google Cloud Global Data Analytics Partner of the Year award, had a chance to meet with customers and Googlers, and had many conversations about the…

Read More >

Reduce Costs by Adding a Data Lake to Your Cloud Data Warehouse

When it comes to data warehouse modernization, we’re big fans of moving to the cloud. The cloud brings unprecedented flexibility that allows you to easily accommodate the growing velocity, variety, volume, veracity and value of today’s data. It also allows…

Read More >

The Natural Evolution of Data Warehousing—Where We Are Today

It’s a fact that technology is always evolving—rapidly. What’s new and hot today, may be old news and on its way to becoming obsolete tomorrow. Traditional data warehousing is no exception. We have been seeing that the old school data…

Read More >

OpenTSDB and Google Cloud Bigtable

Data comes in different shapes. One of the these shapes is called a time series. Time series is basically a sequence of data points recorded over time. If, for example, you measure the height of the tide every hour for…

Read More >

Calculating business days in HiveQL

One of the common tasks in data processing is to calculate the number of days between two given dates. You can easily achieve this by using Hive DATEDIFF function. You can also get weekday number by using this more obscure…

Read More >

Using Ansible to Secure Cloudera Manager Installation on a Hadoop Cluster

Building a secure Hadoop cluster requires protecting a number of services which comprise Hadoop infrastructure. If you are using CDH distribution, then Cloudera Manager (CM) is one of the components that needs to be secured. There is a good step by step guide in CM documentation, and it’s easy to follow for one server, but what when you have hundreds of them? There are different approaches to the problem of managing server’s configuration at scale, but I’d like to focus on Ansible which is a neat framework for parallel commands execution and complex rollouts.

Read More >

HDFS Authentication Puzzle

HDFS authentication model changed in recent releases, but documentation is stale which can lead people into thinking HDFS is using very primitive authentication

Read More >

Debugging IN vs OR Performance in MySQL

I was presented with test results that showed that IN query was about 100 times faster than OR query. Where OR query took minutes to run, IN query took seconds! Okay, I said to myself, it is time to start digging. Here are my findings.

Read More >

Collaborate 2012 as Seen by a MySQL DBA

I spent last week at Collaborate 2012 in Las Vegas, and it was a really great experience in many ways. I am a MySQL DBA and have been working with MySQL for most of my career, so Collaborate didn’t seem like an obvious choice. It turned out that I had so much to learn from Oracle professionals and the Oracle community that could be applied in the MySQL world. For me, an indication of a good conference is when you come back inspired and full of ideas.

Read More >

Once again about innodb-concurrency-tickets

I had to refresh my knowledge on how InnoDB threads queue works the other day when debugging activity spikes on one of the customer’s production system and while I had general idea about InnoDB kernel and queue, thread concurrency and queue join delays I didn’t have a complete model of how InnoDB concurrency control works. So I started from manual…

Read More >
Page 1 of 212