Currently reading:  

Seeing with Fresh Eyes: Meaning Space Data Truth by Edward Tufte


Recently read, with ratings: 

Five Dysfunctions of a Team by Patrick Lencioni
by Kenneth Field. Half textbook, half art collection, so it can’t be rated but should be emulated.

****Death in Mud Lick, by Eric Eyre. The opioid tsunami, as told by the Pulitzer-winning WV journalist.

*****Farm and other F Words: The Rise and Fall of the Small Family Farm, by Sarah K Mock.  What she said.

****A Woman of No Importance (the Virginia Hall bio), by Sonia Purnell

*****The Ends of the World: … Our Quest to Understand Earth’s Past Mass Extinctions, by Peter Brannen.
  (if you don’t know the slow carbon cycle, or the Deccan traps, or the way things fail, read this)

*****Midnight in Chernobyl by Adam Higginbotham. HBO just scratched the surface.

****The Code Breaker (Jennifer Doudna, CRISPR) by Walter Isaacson.

*****March (Graphic biography, books 1,2&3) by John Lewis et al. This happened.

****Vesper Flights by Helen Macdonald

*****Solutions and Other Problems by Allie Brosh. Hyperbole and two halves.

*****Frederick Douglass: Prophet of Freedom by David Blight. 

*****Obama: An Intimate Portrait by Pete Souza. 

*****Animal Anatomy for Artists by Eliot Goldfinger. How to show what’s underneath.

*****The Anarchy: The East India Company, Corporate Violence, and the Pillage of an Empire by W Dalrymple

***The Dutch House by Ann Patchett.

****Radical Candor by Kim Scott

****Say Nothing by Patrick Radden Keefe. Life in the Troubles. 

***The Hidden Wound by Wendell Berry. Starts out right, but doesn’t question his own bias.

***Monarchs and Milkweed by Anurag Agrawal

*****The Life Project by Helen Pearson. For those who like epidemiology, study studies.

**The Art of Loading Brush by Wendell Berry. The first piece is great though.

*****The Emperor of All Maladies by Siddhartha Mukherjee. Possibly the best science writer out there.

Fire and Faith by Rick Maier. I can’t rate this; I know many of the people in it.

*****My Own Words by Ruth Bader Ginsburg. A reminder of where and who we were, are, and should be.

*****Palaces for the People: How Social Infrastructure Can Help Fight Inequality, Polarization,

  and the Decline of Civic Life by Eric Klinenberg. I keep recommending this book to others.

*****I’m Still Here: Black Identity in a World made for Whiteness by Austin Channing Brown

***Chesapeake Requiem: A Year with the Watermen of Vanishing Tangier Island by Earl Swift

*****The Field of Blood: Violence in Congress and the Road to Civil War by Joanne B. Freeman

*Salt: A World History by Mark Kurlansky. Needed a better editor.

****Why Religion? by Elaine Pagels

****On Looking by Alexandra Horowitz

******The Gene: An Intimate History by Siddhartha Mukherjee

****Direct Stone Sculpture by Milt Liebson

***Sculpting in Stone by John Valentine

*****The Totally Unscientific Study of the Search for Human Happiness by Paula Poundstone 

*****Leonardo da Vinci by Walter Isaacson

*****Loving and Leaving a Church by Barbara Melosh

****Vacationland by John Hodgman

****The New Analog by Damon Kurkowski 

*****Evicted by Matthew Desmond

everyone’s a aliebn when ur a alien too by jimmy sun; unrated due to style

***The curious incident of the dog in the night-time by Mark Haddon

***The Arm by Jeff Passan

****The Glass Universe by Dava Sobel

***Born a Crime by Trevor Noah

****Gettysburg: Turning Point of the Civil War by Knauer et al 

*****The Boys in the Boat by Daniel James Brown

****The Martian by Andy Weir

*****The Wright Brothers by David McCullough

***The Map Thief by Michael Blanding

Implementing Domain Driven Design by Vaughn Vernon.  Putting Eric Evans’ work into practice.

*****Hadoop Application Architectures by Mark Grover, Ted Malaska, Jonathan Seidman & Gwen Shapira

*****Big Data by Nathan Marz & James Warren. The first book to really spell out the Lambda architecture.

Domain Driven Design by Eric Evans

ORACLE 12c DBA Handbook by Bob Bryla (w/ Kevin Loney)

***Unbroken by Laura Hillenbrand

Impala SQL Command Reference

Domain-Driven Design Reference: Definition and Pattern Summaries 

***Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, & Matei Zaharia

***Analytics in a Big Data World by Bart Baesens

The Art of War by Sun Tzu

*****Cloudera Administration Handbook by Rohit Menon

**Scaling Lean and Agile Development by Larman & Vodde

***The Innovators by Walter Isaacson

*****The Art of Action by Stephen Bungay

****Command and Control by Eric Schlosser

***Data Science for Business by Provost & Fawcett

*****The Goal by Eliyahu Goldratt

It’s an unusual feeling to wrap up an article on credit score mobility and then find the current

issue of the Atlantic profiles Dr. Raj Chetty, who studies a more expansive view of economic

opportunity inheritance. If I’m asking the same questions as a Macarthur winner that’s a good

path to be on. 

We are fortunate that authors and leaders such as Austin Channing Brown are willing to reveal

as much of their personal story as they do so that those who create, manage, maintain, and

profit from the divisions in this society might have a proper mirror in which to see their reflection

and maybe start to understand another’s life experience in the same society. To pretend society is

equal for all is to say you haven’t looked. In her book she barely mentions redlining; in an article

I’m wrapping up I point out that that practice took taxes from minorities, denied them the services

they were entitled to, and then used their money to fund services for everybody else. Don’t think 

they didn’t notice that. She does not use that as a dividing example, instead relying on her wealth

of direct experience. There are much more doctrinal reads out there on white fragility and on systemic

racism; Ms Brown walks through what living in it looks like. You’ll read it in a day.

Earl Swift’s book on Tangier Island is a mixed bag. If the names Magothy, Severn, Pocomoke, Watts,

and Tangier mean something to you, if you can tell a jimmy from a she-crab, then you should get caught

up on the impact of climate change and post-glacier mantle movement that is turning Tangier Island

into a former island. And that is already wiping out the grasses used by the blue crabs as part of their 

regular life cycles. Swift captures those issues and the impact of past attempts on the continued survival

of this habitat.

Swift’s book falters where he deals with the people of the town, as he develops a fondness and empathy

for them that alters how their interactions are reported. For example, if you’ve ever read a project status

report that was Green for 3 months and then suddenly the project went Red because of bad news that wasn’t

previously shared, you’ll see a parallel here - it’s not until the last quarter of the book that the downsides of

the people are really shown, and even then they are glossed over. The people who love Israel to a fault

but can’t even openly communicate with their Jewish neighbors about the impact of religious symbols on

town property, for example. The dry island with a drug problem. The presentation of a lack of basic 

scientific education as acceptable folksy charm among leaders. The open caustic judgments about other 

peoples’ faith, and the blessed assurance of their own.

Swift could have used those occasions as starting points for discussions about the need for diversity -

in the blood lines there, in schools of thought, in ethnicity, in any of the many ways in which humans 

differ and flourish. Instead, the islanders diversify their means of catching crabs and grow more resolute

in the approaches that got them here in the first place. Swift lets them off the hook, and paints a 

picture of Tangier that never cures its ills because it never acknowledges them.  

Joanne Freeman’s work on the pre-Civil War Congress is remarkable in several directions. As described

in its appendix, it’s original research of behavior that was never highly publicized so she had to use 

multiple sources to determine and verify what happened.

As history, it puts the Civil War debates in greater context. Those debates were not new discussions, and

men like John Quincy Adams had been fighting gag rules and other mechanisms designed solely to 

perpetuate slavery and the slaveocracy for decades.

The extent of the self-defense of the slave owners is stunning. For example, at one point they petition 

the federal government for financial reimbursement to cover their lost property value if a slave escapes

while pregnant, asking the government to compensate them for the value of the unborn slave they own.

You cannot argue with logic like that, you can only denounce it. And when it is denounced, they physically

attack the Northern congressmen in methods including duels, knife attacks, assaults in the street, and 

the caning of Sumner in the Congress itself. 

As a book, Freeman’s work is a great example of how a great writer and editor can publish a work with 

hundreds of end notes per chapter yet the text flows. Remarkable writing. 

Elaine Pagels’ book and Alexandra Horowitz’s both try to merge the personal autobiographic experience

with scholarly discovery. The results can be a bit uneven, as a new understanding is given a heightened

emotional response, or the discovery of a theory long-held in its field is treated as a revelation that

was somehow kept a secret to the writer; but such is the way of an accurate autobiography as we treat the things

we discovery as discoveries for the world, things we did not know as those that were deliberately hidden,

and things that change our perceptions as life-changing. And, well, changing perceptions and enhancing

the ability to perceive is personally life-changing, every day.

For a non-book-related demonstration of this, search on Youtube for videos of colorblind people wearing

Enchroma glasses for the first time and seeing color. 

Given the value of a third-person perspective in biography, it would seem the most balanced form of

biography may be written by a contemporary while the subject is alive (see Isaacson’s Steve Jobs, e.g.).

Your credit score is a factor in so many aspects of life these days. Two quick observations - 

1 - there are 5 differently-weighted components to the score. Most people could confidently name maybe 

two of them, and their relative weights would be a guess.

2 - Past analysis (2008) published by Experian showed that length of credit history was overemphasized 

in the score at that time, to the benefit of older persons and the detriment of those who are new to 

the American economy (young persons and recent immigrant professionals).

Mukherjee’s The Gene should be mandatory reading for everyone with a genome. So much of what we

think of as binary, straightforward, simple logic fails even the most basic of tests when applied to our own

genetic history. Start at what we all know: women are XX, men are XY - except when that’s not the case.

The mechanics of genetic behavior are known for parts we have studied extensively that are tightly tied

to specific diseases or symptoms; but the inter-gene dependencies are vast, and their exact relationships

may take decades to understand. It’s a fascinating read that keeps challenging you.

From studies of mitochondrial DNA, all women on the planet can be shown to have descended from a single

member of a group of humans who lived in southern Africa about 180,000 years ago. She has been dubbed

Mitochondrial Eve. She is not the first human, but she is our most recent common ancestor for everyone 

you have ever met. She has a wikipedia page and everything.

So, two quick notes from that:

1 - As Dobzhansky noted 80 years ago, every current group we call a race is the result of a mixture of 

prior races.

2 - If you can read this, you share a common ancestor with everyone else who can read this. 


Nine years ago I donated stem cells for a cancer patient who needed a bone marrow donor. We were listed

as unrelated donors. Really, we were just anonymous, as at some level we had a set of common ancestors.

Liebson’s Direct Stone Sculpture shows its age (some washed-out photos, recommendations for products that

aren’t sold any more) and his overview/history parts should have been either greatly expanded or cut. But 

through it all you constantly see the impact of a teacher who has actually done the work he is talking about,

and who has had students try things that should not be repeated. And he wants the reader to not make the 

same mistakes, so both positive and negative guidance is shared, along with the reasoning. The reader 

then knows what to do and how to make decisions about it without replicating the mistakes of others. The 

inclusion of this guidance helps further support the lessons he gives, an important characteristic for any teaching.

Just by virtue of its inclusion, Paula Poundstone’s book should get the highest rating possible. From an analytic

perspective, she does manage to illustrate the conundrum for many idiopathic studies: when there is only 

one subject in the study, there is no control to compare to; and unless that subject is fully isolated it is 

very difficult to assign causation clearly. For similar examples, look into the attempts to diagnose and

treat migraines. Each subject has to be separately evaluated, and the mixture of environmental and 

pharmaceutical stimuli will change person by person.

If all we had of Leonardo da Vinci’s work was his anatomical drawings and insights, those alone 

would put him on a pedestal among the greatest scientists and observers of nature in history.

Kurkowski’s The New Analog shines a light on the things we’ve lost when all music is presented as 

contemporary: the history of the piece, its influences, its contributors, its place within a timeline of

effort from a writer or performer or composer. A sample music listing eliminates everything that was part

of the liner notes along with all of the performers and contributors.

Kurkowski uses an example from Pandora to show this, but I’d use Leon Fleisher. Fleisher was a world class

pianist when he lost the use of his right hand. He then spent decades mastering the works written solely

for the left hand; becoming a conductor; learning how to teach effectively. And now as he turns 90, the 

results of botox and rolfing have returned control to both hands and he is once again playing two-handed

concerts. It’s a remarkable story. But if you are presented by just an iTunes library-style list of his works

then it appears those works are all contemporaries of each other and you have lost the narrative that his

own efforts tell about his life. The data is there, it’s just been left out of the interface as we have dumbed down

the data presented to the customer and eliminated the backstory that would take time to read.

To echo Stephen Jay Gould, evolution implies change - it does not always imply progress.

Desmond’s Evicted will challenge pretty much every preconceived notion you may have about the housing

situation in US cities. I think he could also have made the case that a number of elements

of the financial systems that are supposed to be trailing indicators, such as financial scoring, have morphed over the years

into being driving factors. Those who rely on such scores should evaluate their effectiveness in their actual 

usage - much as the intelligence tests of the mid-1900’s were re-evaluated after they were applied to whole populations

rather than just individuals. As Evicted points out, eviction causes poverty (and many other conditions) rather

than being the rarely used last resort when dealing with tenants. It’s not a surprise this book won the Pulitzer.

The output of the women who gathered so much of the early astronomical observations in the US (as described

in The Glass Universe) is being digitized.




The problem with being in the data field is that you find yourself analyzing the data that was available to characters

in books and movies when they make decisions. The frustrating thing isn’t when that data is incomplete - that is

just real life. It’s when people choose to make decisions off data sets they know is incomplete, and they gamble

with their lives or other peoples’ livelihoods as a result.


When viewed from far enough away, your behaviors form a repeated pattern. And those patterns can be compared to other 

patterns out there and judged to be on the right track financially, for example. That same logic applies to other 

components of life, and we inherently admit that when we watch a character on the screen take a step we all know is wrong.

You don’t need mountains of big data to see that. You just need to be looking for it.

In reading a detailed history of Gettysburg it’s difficult not to look at the decision making process and the lack of data 

the commanders were working with. And their costs weren’t misspent capital dollars or minutes of downtime, they were measured

in the thousands of lives.

I have maintained for a long time that ‘decision support systems’ are usually mis-named; they are usually treated as decision

 justification systems. This is a fault both of the business users and of the imagination of the technologists who design and develop

the systems. We should be looking at enabling potential use cases flexibly (empowered by schema on read capabilities) rather 

than being tied in to exact requirements someone provided a year or two earlier.


The users are at fault too. Consider The Big Short. As shown in the movie, Christian Bale’s character went to Goldman Sachs to

bet against the mortgage bond market. The analysts he met with there did not check any data; they simply looked at each other 

and said with assurance that no one stops paying their mortgage. Bale’s character already knew that was wrong, because he had

looked at the data and had seen the default rates. Surely there was a decision support system somewhere at GS. It’s not a matter

of the data being batch or realtime; it’s a matter of the data being relevant and valuable. Data doesn’t support decisions, 

it drives them. It initiates them, when used properly. The task of the architect is to implement a system that enables that usage.


Much of my recent reading has been about blockchain and bitcoin, which are not one and the same. For those who are fans of

architectures featuring immutable records and distributed authorities, blockchain technologies offer some interesting solutions.

The difficulties when applying them to financial transactions actually come from the contract side, since sales and purchases are 

contracts (even though you don’t usually read all those iTunes terms and conditions). The lack of a central governing 

authority empowers blockchain architectures - the more miners the better - but it makes the handling of any disputes 

regarding those contracts difficult. That is, it works during happy path transactions; my concern is for the 

imperfect transactions, the fraudulent race conditions, the NSF purchase attempts.  Investigations continue. 


For Hadoop architects, imagine a Kappa-like architecture, feeding a shared and encrypted global ledger that has a periodic 

checkpoint to allow for settlement assurance. Each block’s hash is chained to the data from the prior block. There are

lots of uses for this outside of an invented currency. The miners, after all, have to effectively communicate the status of 

their hashed blocks across the entire network while maintaining the integrity of the chain. That’s a highly complex bit 

of replication and communication to do at speed.


The Wright brothers, as recorded in McCullough’s book, faced a quandary during their tests in Kitty Hawk - 

they had modified their original flyer based on the work of their predecessors, and their new flyer was worse. They finally

had to admit that the leaders they admired and followed did not have all the answers, and they set out to find the real answers.

So they built the first wind tunnel. They tested wing shapes. They tested controllers. And they invented controlled flight.


For all their genius and perseverance, it was their insistence on demanding reality that led them to innovation and success -

not a pursuit of riches but a pursuit of the truth.

Having developed an application that ingests Twitter feeds via Flume into a laptop sandbox HDFS implementation, a 

few quick observations - 

First, not only are the common examples showing how to do this wrong - they have the wrong library for TwitterSource - 

but the default Flume configuration file provided with CDH5.5 Express repeats that same errant call. It’s an out of date call.

Second, no example out there references the useLocalTimestamp option within Flume, which is key to getting rid of those errors

that will show up in your logs.

and Third, the volume of tweets that come through is pretty astounding. The number of them containing actual valuable 

business information is left as an exercise for the reader.

In one of my many roles, I play the part of the Facebook Group administrator for a nonprofit. As people request access

to the group, I approve or reject them. At first, approvals outnumbered rejects 10:1. That ratio is now reversed, with most

requests being rejected. Why? 


At first, the rejects were people who were looking to advertise on the group board (MLM schemes, etc). They were easy

to spot, since a quick visit to their home page would show that they belonged to many similar groups. But the behavior 

these days is of a different sort.


It’s fraud. Or rather, it’s the attempt to set up an account that will look real at a later date. An account that will exist for a given

period of time, belong to groups, post here and there, and then one day do something that requires a fraud check to see if it 

is a real person.

What does a real person ‘look’ like online? When they apply to my group, I can see they aren’t real on Facebook - the photos 

don’t make sense, or the name is a guy’s name but the profile photos are women, or the locations don’t match, and the group 

memberships as a set are bizarre. The developers are hoping that their bots look human enough to survive and develop 

a history. Then, they can apply for credit in a country that does not have a centralized credit bureau, and which uses social 

media scanning to do fraud profiling. And they’re hoping that that scoring is not sophisticated enough to look at the combination

of life choices they’ve put together and question it - you live in CA but go to church in VA, you’re a woman who buys moustache

wax, you’re a practicing Hindu who follows Arbys on Twitter…


Those fraud checks will be happening in the future. The disruptions to them are happening now, courtesy of linkedin group 

admins and Facebook admins who don’t accept every invitation request. It’s a matter of disrupting the signal in a machine learning

pattern that will be executed by a fraud analytics engine, and helping the bots stand out as bots because of their lack of standard

behavior. Humans have a tendency to see patterns - sometimes even where they do not exist - and we are adept at being 

bothered by the disruptions.

The weather forecast for the most recent storm called for this region to get 5 to 7 inches of rain. It got 1. Predictive analytics 

for weather behavior is frustratingly poor if being wrong by that much - the difference between a normal rain and a tropical 

storm basically - is acceptable. It’s interesting to me that in the last two major weather events here with divergent models,

the one labeled the “European model” has been the most accurate at the start; the others tend to over-forecast calamitous landfalls.

An article in The Atlantic focuses on the topic of introverts and how educational methods have changed to focus 

on group-based learning methods:

As this applies to personnel development in general, it’s worth a read. There is certainly value in collaborative exercises and 

collaborative learning; much of what I do involves group efforts. However, those group efforts presume that each individual in the 

group has developed certain skills and brings to the group a set of attributes that is often best developed independently. We can 

help each other along as we learn individually, but we do learn independently. It also presumes that all types of learning can be

learned collectively, and I think that is the greater error.


Let me offer an example from my first marathon - when training for it, I ran with others who were supporting the same charity. We 

each had to develop our selves individually. We supported each other, but ultimately each person had to develop the stamina and 

the mental fortitude necessary to push through the challenges involved. You had to train your own body and brain for that.

Children learning in a academic environment need to learn similar lessons - that they can push through difficult problems, and how

to apply the consistent focus and effort required to do so. How to debug problems. How to design. How to envision a solution and 

drive it to creation. How to write (a fellow writer once told me that he’d been told that writing is the practice of applying the seat of 

the pants to the seat of the chair). Each of these is an individual effort, a personal developmental goal. And the kids need to learn that

they can do this by themselves. They have that capacity. And then every group the individual is a member of then benefits from 

this individual development.


Not only are some skills best learned individually, some can only be learned individually. We cannot learn object oriented 

programming thought processes as a group; you have to each understand it. Or Lean methodologies, or domain driven design

implementation. We have to rely on each group member understanding what is going on or it is a fragile house of cards. We can 

use some group teaching methods and exercises, but anything that involves a thought process shifting involves a person.


Applying this to the article, we do a disservice to introverts in general if we force them into highly stressful situations in which 

their learnings are impacted by the environmental conditions. If the goal of the educational process is to develop children in 

the first place, what is the goal of of forcing individual learning programs into a group-based approach?  Such a transformation 

undermines or eliminates opportunities for individual growth (there are no group-based GPAs or scores on the SAT, btw) when that may 

be the most appropriate educational method while also creating a more stressful environment for the very children we are 

supposed to be raising up.


Hadoop Application Architectures is listed as having four authors, but you won’t know it; the editing is tight enough that 

only a few spots stand out as a shift in writing style.  That’s unusual in a multi-author book unless one of the authors takes on the role

of being the primary author and filters all the material, which creates a serialization in the writing process.


In terms of content, the authors (Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shaipra) provide decent overviews

of the major tools in the Hadoop ecosystem.  You’ll also find the key criteria that will help when making decisions among architecture

components and design patterns. For example, what sort of file compression algorithm should you use and why? What tools are best 

for micro batch processing and what are best for streaming needs?


The book closes with three extended examples that are excellent choices because they are so relevant and because they show the 

gamut of technologies and decisions involved. These examples are clickstream analysis, fraud detection, and data warehousing, and 

they’re not just tacked on chapters at the end - they’re 90 pages in total, and they build on the chapters about modeling, and data 

processing, and the key tools involved. These chapters really show how to deliver the value from your Hadoop investment and from 

the book. Which is to say, they are outcomes-based chapters, a wonderful thing to find in a book.  As such, they provide technical 

patterns to follow and a writing pattern to follow.


In reading the book from cover to cover I think I came across about 30 references to other books, including a few that are just now 

coming out (including the Marz & Warren Big Data book reviewed below). As such, this book serves as a hub to those other works, 

and it is a very strong starting point for any Hadoop developer.  Have every developer on your staff read it. Twice.

5 stars.


Marz & Warren’s Big Data is mis-titled; it’s also the best-written technical book I’ve read all year and the definitive work on the Lambda architecture. For those looking to use Hadoop to rethink their data processing and enable significant business transformations, Lambda and Kappa approaches offer strong possibilities. Don’t just use Hadoop to replace components in an old architecture; that works, but it won’t solve your core issues, and Marz and Warren explain the issues and solutions clearly. It’s not a perfect book, but it’s my second 5 star read of the year.


The flaws of the book come from its implementation chapters in which it favors the use of nonstandard frameworks 

such as Pail. Its strengths are the theory chapters and the overall editing and structure. Readers new to big data can 

find their way through it. Readers who have experience with Hadoop will understand the architecture - how they should 

have done things the first time if they had known better. Readers who come from a RDBMS background will be able to

draw parallels with core concepts such as MVIEWS and internal triggers, but they will also learn that they have a long learning 

road ahead of them.





One interesting aspect of big data systems is their ability to support eventual accuracy; provide answers that are close to the

correct answer now while calculating the correct answer in the background and catching up by the time it makes a difference.

(You use these now - Google tells you your search will return about 4,000 results but by the time you get to the end 

of the list it was really 3,665.). Approaches like these allow us to quickly bypass cognitive biases in decision making. For a

quick rundown of the most common cognitive biases, see:



In late 2000 I was at Oracle Open World in San Francisco, and I went to dinner with Earl Shaffer. Earl and I had complained

for years that the Delaware Valley Oracle Users Group never met near Delaware; at this meeting over pizza, the two of us created 

the user group that would meet in Wilmington. Earl provided the drive, I had the contacts and together we had a plan.

Earl wrote the bylaws for the group.


Pete Silva and David Alpern and Dan White and others stepped in to be early leaders and contributors and on September 10, 2001, 

the First State Oracle Users Group debuted with a meeting of 60 people in a hotel ballroom.

Rather than assign himself a high board position, Earl was an At Large member of the board, able thereby to direct things and make sure

things were done right. For four years we worked on the board together before turning over the FSOUG to the next generation of 

leadership. We later worked together in the workplace as well.


This past Sunday, Earl passed away. For all the lessons he taught people at Oracle conferences or during consulting engagements

I think the one that stuck with me was his persistence in doing what he knew was ultimately the right thing to do. The FSOUG 

never directly benefitted him I think - but he knew it would benefit other people, and he was right, and he made it happen. 

There have been a number of interesting analytics news articles lately.  Among them:

1 - First, a recent study showed that only about a fifth of the 5000 hedge funds studied exhibited real “active” management that differed

from what you could get for cheaper from an index fund.  Further, from 1995-2009 the hedge funds’ returns were below those of 

the non-actively managed funds. The thing about big data is that you can verify your assumptions instead of taking behavior

for granted.

2 - With the release of the Ashley Madison data, it turns out that behavioral analysis of the data is reported to reveal that 

far below 1% of the paid members there who self-identified as female ever did something non-bots do, such as check their inbox

and manually type in a reply. In this case, that verified the assumptions and the stories of former workers and bloggers.



No, I have not been drafted by the Golden State Warriors. That was Kevon Looney.


Amazon thinks I would be interested in buying the Oracle 12c DBA Handbook :) . Although my name is not on its cover,

alert readers will find the sections that remain from the earlier editions I wrote. The overall structure remains the same,

and brings to mind the arguments with editors back in 1994 over the roles of DBAs and the type of topics they would

want to see in the table of contents. As an unnamed coauthor, I won’t be reviewing this book; and even though my

name is not on the cover it’s my 20th book. Osborne McGraw-Hill sent me a box of them, so I won’t be buying them from Amazon.


When I first started writing the books, the effort was purely as an effort in professional altruism. They sold far more than I ever 

thought they would; I was mostly trying to get the implementation patterns out there because the chat groups were filled 

with people making the same basic mistakes over and over. The book form just allowed the articles I was writing to reach a 

broader audience (I had been writing for Oracle Magazine since 1990, preceding Ask Tom). This was before Oracle 

certification tests (either from Chauncey or from Oracle). And it was before people started scanning PDFs of books

and posting them online, effectively taking the intellectual property of authors and publishers and giving away for 

free what they worked so hard to write and produce. That’s why the production of technical books has slowed so 

dramatically - the rampant theft of the product delivered has eliminated much of the profit so the main benefits left are altruism,

self-education, and branding.


When doing an update of a prior version of a book I had written, for a topic I knew well, I would estimate 1 hour per page.

For a 1000 page book, that’s half a year of work time. It requires a high degree of attention to detail for months on end.

You cannot take pages off. You have an obligation to your readers to get it right. It is hard work. Remember writing a 20-page

term paper? Try writing 50 of them in a row that all go together in a coherent fashion, for a worldwide audience, and that’s sort 

of what it’s like - 

and you are sacrificing your work hours and your family time to do this -

and then someone takes what you wrote and decides they have the right to make a copy of it and give it away for free.

They decide they have the right to violate copyright laws, post scans, make copies, and deny the author compensation 

for the knowledge and effort that went into writing that work. 

The market reacted, and the authors slowly invested less time. Which is a shame, because I know how those books

helped enable career changes and personal development.

The new model I am working on is based on self-education and altruism. The new book I am working on is set up to encourage 

collaboration and contribution and constant development. The structure is simple and allows for the benefits to be more quickly 

realized - so the implementation and design patterns can be commonly shared, and we can move forward together.

The June 2015 edition of The Atlantic includes a short excerpt from The Dorito Effect relating a story of four Danish 

scientists who in 2002 examined 3.5 million grocery transactions to try to understand the health benefits of drinking wine.

What they found was that wine drinkers shop differently than beer drinkers - they eat more olives, low-fat cheeses,

fruits and vegetables, spices, and low fat meats; those things go with wine, after all. Beer drinkers were more likely

to reach for chips, ketchup, margarine, sugar, ready-cooked meals, and soda.


So while the health benefits of drinking a glass of wine (or cup of coffee, or some other individual thing) may be cited,

exploring the data shows that that’s only an indicator of something else at work - in this case, that when you drink a beer 

you’re more likely to order a burger and fries, and when you order a glass of wine you’re more likely to order the salmon.

Red wine may get the headline, but it is the related effects that have the impact, and those are the important factors

for the data scientists to unearth - and to find other ways to influence.

I spent a good part of the day listening to presentations from high school freshmen. With some practice and mentoring,

I believe all presenters can improve the value they deliver significantly. 

next presentation: Impala: Introduction and Beyond, with Nitin Bokka and Radhika Singareddy, at the Hadoop forum.

Roughly 2/3 of the session will be hands-on demos including Hive, Impala, and Parquet.

Much of predictive analytics is based on the concept that future behavior can be inferred from past

behavior. The danger is that clustering people based on attributes (say, single working mothers in Baltimore who

shop 3x a month on Amazon) doesn’t lead to a single set of predictions that apply to every member of the set.

Past behavior qualifies you to be included in a set, to be classified appropriately.  Understanding your current position

as a point along a path, and predicting that path, is the data science role. Adding more variables (age, credit score,

number of children, health attributes, etc) would lead, ideally, lead to further narrowing of the risk that you have 

mistakenly inferred something about the person’s future path. In human development, two people may start 

with the exact same genetic makeup but one may be subjected to environmental effects the other never experiences, leading

to an array of ailments or life changes; as we develop models for categories of individuals and try to predict their paths,

data science can never forget about the individuals.

Work is moving well on the new Hadoop-based site.

Current project: developing a new Hadoop-based site. Also, the momentum for the Delaware Hadoop Users Group

is growing; announcements coming soon.

Learning Spark is not a simple book for brand-new beginners to use. I’ve already worked with Spark in virtual machines, gone

through comprehensive python scripts and scala examples, and done basic admin for clusters. I’ve set up Spark streaming and done 

the basics for data capture in RDDs and RDD processing. There’s a lot that goes into just getting things set up right so you can use it 

properly. Learning Spark is best for readers who have already learned Spark and played with it.


The problem is that a paper book is probably not the right delivery mechanism for teaching Spark. The reader has to be

able to install Hadoop, Spark, shells, programming languages, and configure them properly. You should know MapReduce

concepts and/or language, and the programming languages involved. The examples in the book then have

to cover the language you’d use (java, python or scala) but by default you will not use the examples from the two

other languages shown. As an author I sympathize with the authors. As a reader I think this is really two books - a reference guide

and a tutorial. As the book bounces back and forth between being a reference guide and a tutorial, the authors try to convey

the most important aspects of each option available. I think a different format would have served them better; a purely technical

reference guide on architecture and design/implementation patterns could be 90 pages and highly valuable. A walkthrough for 

newbie developers could be an online VM pre-loaded with examples in the language of their choosing. By trying to be both,

this book does not quite achieve the goals of either audience. You’ll find one word count example, one CSV example, etc. The later

chapters on Spark SQL and Spark Streaming (at 30 pages, the longest chapter) provide a greater level of detail on those tools.

If you are brand new to Spark, you will need to test things out on your own for a while and come back to the book again.


This is not to say it is without merit; it is the only work out there on Spark, and its coauthor wrote Spark. You should see it as the 

prequel to “Design Patterns for Spark” that someone should write. Having had the opportunity to sit in on Cloudera’s Spark 

training recently, there are definite scope differences; the book covers MLib, while the Cloudera training emphasizes 

how to apply flatmap and other functions in many coding exercises. 


Lastly, I have had to repeatedly correct people who tell me that Spark is the next Hadoop; that tells me that they do not 

understand either Spark or Hadoop. If you want to cast it as the next MapReduce - even then you are only seeing one part  

of the picture.

Cisco has decided to join the Big Data party, and has announced reseller partnerships with Hortonworks, Cloudera, and MapR.

In the UCS announcement, they refer to the “internet of everything (IoE)”. One supposes that the IoE is the successor to the 

Internet of Things (IoT). This in turn raises concerns, because it implies that there are things (in the IoE) that are in everything

that are not themselves things (or else they would have been in the IoT). The Venn diagram is askew. 

Neither marketing term is descriptive or useful. The terms are so broad that when devices are rolled out they will have to 

have names that imply their niches for the things they communicate with - a concept and scenario that is not novel.  

Cisco announcement: 

For fun, calculate the NPS for and report it to them. Use existing social media feeds where people

post complaints and praise to compare prior year NPS values to prior year feed activity, and then extrapolate that

to a predictive model.

Learning Spark is now in print. Since one of the coauthors is the creator of Spark, it’ll be hard to find a more

authoritative author list.

If you haven’t seen Harrison Craig’s audition video, it’s worth watching.

From an organizational perspective, there is an interesting dynamic - the performer picks the mentor/manager. 

What would a company look like if the highest-performing individuals were allowed to pick the managers they worked for?

Quite possibly it would be skewed, with the highest-performing individuals clumped under the most innovative, best delivering,

highest integrity, best mentoring managers. The organization would self-select its best leaders. It would be unbalanced.

And that might be a good thing for some, because not all leaders are equal; but at the same time, not all leaders can effectively mentor 

their people, and not all students effectively try to develop themselves or have the raw talent. If Harrison doesn’t have these skills 

coming in, no coach can invent those skills in him. 

If they are denied access to a mentoring, high-performing manager, will high performers choose to leave the 

company in search of a better mentor elsewhere? Any reorganization that does not consider what the choices would be if 

the students picked the teachers will create risks. Because students will look at their new managers and 

wonder what they will be learning, and if they could be learning more on another team. They should be eager, and enthusiastic,

challenged, and growing. If you are a manager and that does not describe your directs, then part of your job is being left undone.

Third-grade math puzzles are the clickbait of linkedin. If only something more substantive could be generated.

I’m reminded of the TED talk in which the speaker was using the CAPTCHA code interpretations to decipher

text from hard to read manuscripts.

Every new processing engine for Hadoop feels compelled to start its demonstration of capabilities by

executing the word count map reduce. It’s the “Hello, World!” of big data, and just as unfortunate. There should 

be a more challenging or relevant example. Let’s start with sentiment analysis…

For the record, Doug Cutting pronounces it had-oop. I think we can consider that authoritative, since he 

wrote it. Lots of good discussions on its future this week.

Speaking of which, there’s a Delaware Hadoop Users Group I’ll be kicking off shortly.

Bart Baesans’ book Analytics in a Big Data World is a case where the title almost undersells the contents.

Following a first chapter that is completely forgettable (it’s unfortunate, since my editor will tell you it’s the most 

important chapter when it comes to selling the book to a prospective reader), he launches into 130 solid pages 

that covers descriptive, predictive, and survival analytics, followed by use cases. The material is technical

enough that analysts will be able to follow it; the technical staff working with them will be able to understand 

why they’re doing it.


The later sections, on applying these techniques to big data applications/environments are more sparse than

expected given the level of detail in the earlier chapters. You’ll see where to go, but it would have been nice

to see more concrete examples, code, finished products, and design patterns to follow.

This field as a whole suffers from incomplete product sets being used to assemble complex works of art

for which design patterns have not been vetted in production for very long. Baesens helps show what 

questions are possible to ask. Whether your design can support them is another matter. 


Recommendations on learning to code:

One of those references kids’ tools.  I would not underestimate them.  You should see what the STEM

class kids do with raspberry pi devices.

Also, although it is not purely a coding site/app, check out the Coursera app.


Next appearance will be in Philadelphia on 2/9, at a Cloudera panel on the future of Hadoop.


It seems kind of wrong to assign a rating to The Art of War. I read it chiefly as a comparative piece to see

how it tied to The Art of Action.  There are definitely places where they agree explicitly; on the need for clarity

in commands, for example, or in the general definition of the conduct of a good general.  But Sun Tzu’s generals

deliberately place their soldiers in harm’s way so they will fight their way out: “Place your army in deadly peril and

it will survive; plunge it into desperate straits and it will come off in safety.” (ch 11).  That presumes those soldiers have first 

been given all the advantages of terrain and timing Sun Tzu demands, but it is still a motivating factor not commonly

read in leadership books that claim a strong parallel between business and war (as The Art of Action does). 


Business may think it is war, but war does not think it is business.  Going back to Command and Control, to quote Gen

Curtis LeMay: “I’ll tell you what war is about. You’ve got to kill people and when you kill enough of them, they stop fighting.”


The folks at Grantland weigh in on women in computing, drawing mostly from The Innovators and highlighting

some of the non-white-male innovators:

Rohit Menon’s Cloudera Administration Handbook isn’t perfect, but it is as accessible a book as you’ll find

for new developers and administrators. The steps are in the proper order, and they work. The sentences are

declarative and accompanied by appropriate graphics and simple flow diagrams. It’s very much written for 

beginners but at the same time assumes you can do challenging commands. Recommended for new Hadoop admins

and for developers who wonder what is going on underneath.  

As an author, I appreciate when an author takes the time to write clear, declarative sentences. When writing

material for beginners, include diagrams to simplify visualizing how components fit together. Provide a 

logical flow to the material.

If you are looking for a great example of this, I recommend Rohit Menon’s Cloudera Administration Handbook.

Not only should this book be handed to all new hadoop users and admins; it should be handed to people 

considering writing books for beginners. Rohit’s site is


(My assumption is that the index was automatically created by the publisher via a program that looked for 

bold terms, which is why MIT is in the index by the fair scheduler is not. That happened to me before.)

Given a new hadoop environment to browse around and get the feel of, the quickest port-agnostic

approaches would be 

1 - going through the namenode screens on port 50070 to see what users and directories are 

out there, what is working and is failing.

2 - use the hadoop dfsadmin -report command (it’s deprecated in 2 but still works) to see the 

nodes involved. The report will show the space usage and pct free by node as well.

3 - hadoop dfsadmin -printTopology (also deprecated but still works).


The rest of the approaches involve using the native toolsets for the environments, such as the 

Cloudera management tools. 


Chapters 6 through 12 of Larman & Vodde’s Scaling Lean and Agile Development provide a good level

of detail on Agile. They are significantly different than the whole first half of the book; they are much better

written, with more coherent flow to them. Of course, people who are buying the book to learn how to scale 

agile & lean probably already know the information in these chapters.


Overall, the first 5 chapters of the book should be read as bullet lists - powerpoint slides transformed into

pages. Use them to determine what topics you want to pursue further, and use their references to go find 

further information. Don’t expect any level of detail or analysis. For example, Goldratt’s Theory of Constraints

gets this writeup, which I swear I am not making up:

“Basic TOC has appealing logic to it, such as focusing on the major bottleneck and reducing it.  Try that.”


It goes on from there - a two sentence paragraph in which the second sentence is “No problem.” followed by 

a longer paragraph in which they shared actual experiences with to separate TOC attempts that failed when they

were applied to project management. Exactly what failed is not shared, other than that it was heavy and not agile

and not lean. But TOC is not contrary to lean, and no details are given, so it’s hard to puzzle out the learnings 



The authors of this book have apparently subsequently published a Practices book. It’s a shame more

practices were not integrated into this work.


Larman & Vodde’s Scaling Lean and Agile Development is an unusual book to review. It’s more of a 

collection of paragraphs than a book. There is little consistency across chapters or even within chapters,

so as you are reading there are sudden jumps and ad hoc comments that are unrelated to the rest of

the material. It reads like a series of ppts that were written into paragraph form, with more editing needed.

For editing fun, compare Figure 3-7 to Figure 4-4.


As for content: its chief fault is that it does not include practical details for implementing the practices it describes. 

On the Amazon reviews, the book’s reviewers defend the authors and then cite a companion book that has 

just come out. As a standalone book, I think this book prompts many more questions than it answers. They’re good

questions, but they are unanswered here.


Searching for part-r-00000 returns over 2 million hits, some of which are people’s MapReduce result sets.

Google and Cloudera bring Cloud DataFlow to Apache Spark:

Entry price point for a Hadoop cluster is now $170. Yes, you can create a Hadoop cluster on raspberry pi 

devices. For example:


Interim book review notes:

Scaling Lean & Agile Development gives a good overview of agile and lean methodologies and tools.  In its 

print form, however, a number of the graphics are illegible - they are photos of white boards, and they come 

out as grey on grey blobs when printed. It does a disservice to otherwise worthwhile content when the 

illustrations of the concepts do not illustrate anything. Surely I’m supposed to be able to read them…

As an author or editor I would never have signed off on those as final page proofs.


Overly optimistic problem solvers tend to jump ahead to solutions before asking if the problem we’re working

on is the right problem. For example, how do I scale agile for my overseas teams?

Larman & Vodde’s Scaling Lean & Agile Development starts off with blunt advice about

large, multisite, offshore development: “don’t do it.  There are better ways to build large systems than with

many developers in many places…”

But since you’re going to do it anyway, against their advice on page 1, they then walk you through approaches

and experiments, including a theory of constraints analysis.


An excellent weekend to see Selma. 

There’s an odd connection between it and my recent reading list. Curtis LeMay, who is portrayed 

Command and Control as the tough-as-nails leader who basically creates the US Strategic Air Command

and makes it operational, once ran for the Vice President of the US on the American Independent Party

ticket. The Presidential candidate on that ticket? George Wallace, the governor of Alabama during the 

Selma march. It was the 1968 election, and they were soundly defeated - a poor end for LeMay, who 

had devoted so much of his life to the country’s defense (having initiated the Berlin airlift earlier). LeMay did 

not share Wallace’s social views but joined the ticket to oppose the war approaches of the two major-ticket



The most repeatedly innovative companies are the not the ones with lone geniuses. They are the ones that promote 

the interaction of interested people from all walks of technical life and customer interaction all the time. They configure

their offices and their org structures and their business practices around this. A company is a system, and the 

system is designed to generate an output. Is your company structure designed to keep things running? To 

control costs and risks? To minimize the distance between customers and technology? To keep all aspects of 

technology and customers near each other? To keep its own processes functioning? To minimize costs and 

emphasize profit margins of existing products?


Ultimately the company rewards itself for its behavior - and that behavior will be repeated. Is the rewarded 

behavior based on innovation or on process execution? 


What was the last thing you played a part in inventing? And how did your company celebrate that?



“Can a crocodile play basketball?”


Ask a computer a difficult factual question that humans have already stored in it, or to extrapolate a fact

based on known trends, and we can expect accurate results - market basket analysis results, historical

facts, geographic facts, etc. But while your young children can tell you that a crocodile can’t play basketball,

a compute can’t deduce that simple fact on its own. True AI has been a grail for decades, while the real gains 

have been in the knowledge augmentation gained from networking millions of computers, libraries, wikis,

and people. 


Isaacson’s book concludes with comparisons of AI and knowledge augmentation. That’s unfortunate, since

there are a lot of other directions he could have gone with it, and they would have been more valuable to the 

readers. For example, why were these specific innovators successful when others failed? For the innovators

profiled, some had spectacular failures as well - Jobs was forced out of Apple, NeXT drifted until it was saved

by Perot, IBM lost control of the PC market, what happened with HP printers, Compaq PCs, AOL…what

happened to the Altair, the Commodore 64, Atari? Why do market-dominating innovators lose their edge? If

you study what changed that took them out of that innovative position then you can further isolate what it was

that got them there in the first place. In many cases, it is the willingness to fully commit to the right idea. As quoted

(sometimes attributed to Edison, and quoted by Case), visions without execution are hallucinations. The Innovators

could spend time on failed innovations, and those could be even more useful as learnings.


Isaacson wraps up with AI, I believe, because as a writer he is trying to wrap back to the beginning of the book

(to Ada Lovelace). That’s really unnecessary here - this isn’t a novel. A stronger ending would be to take us back

into the ENIAC labs, on a cot, next to the engineers who were programming it throughout the night by switching 

wires and contacts around.  


continued impressions from The Innovators:


A lot of times, the “light bulb” moment for an innovator is simply the realization that a lot of work is needed, and the

dedication to follow through with it. The story of the origins of Microsoft follow this pattern - the realization that the

home computing industry was starting and they needed to get in it launched Gates and Allen into intensive activity

that pretty much did not relent for years. They wrote the first BASIC for the Altair by putting in the hours, challenging

each other to code more efficiently, building software emulators for chips, and working nonstop. Innovation requires

many things - and chief among them is the willingness to fully dedicate the right people to a task they fully own

and are accountable for.


“The best way to predict the future is to invent it.” - Alan Kay, as quoted in The Innovators.


And if you don’t know who Alan Kay is, use your PC, laptop, or tablet to look him up.

Not everyone is in favor of seating arrangements intended to spur innovation:

Her focus is on her productivity, not on innovation or cross-team functions.


Signed up for the Delaware half marathon. This will be my 7th year leading this as a fundraiser, having 

run the course 5 times as a half marathon, once as a full marathon, and once when it was a 10 miler.

In further reviewing innovative companies through tech history: Bell Labs, Intel, Apple, along with others 

outside of the tech world, commonly structured their physical spaces to encourage interaction among

people of various disciplines. Isaacson’s biography on Jobs talks about this a lot; how he agonized over the 

pathways people would walk during the course of their workdays, and how they would most naturally 

interact with others outside of their work areas. Bell Labs set up long corridors. At Intel, Noyce took a small

desk in the middle of a sea of desks. 


Now think about agile work spaces. Agile work team spaces are optimized for delivery. Each work team is 

focused on one specific technology function, delivering for one small technology area, continuously refining

the backlog for that area. A typical agile work team has 6 to 8 developers and a total of 10 to 12 people

altogether including testers, scrum master, and non-development personnel. The agile developers tend

to be highly specialized because of this org design; on the Web Java team, given only 6 developer spots 

you’d expect all 6 of them to be Web Java developers. And tables of agile teams tend to be deployed 

together, in pods of similar development teams - so all of the Web Java agile teams are near each other,

for instance.


But for innovation to occur, I’d want the people designing the middleware services to be sitting next to the 

Oracle DBAs so they can understand how the databases are being used. I’d want the DBAs sitting near

the configuration management team so they can understand how deployments work and so they can talk

about the deployment issues for applications at the design stage. I’d want the performance management 

team sitting in the center, next to the architecture team, so everybody going to the elevator has to trip 

over them and has to see every monitor they have up and has to stop and ask what those lines mean. If 

each of these job functions is on a separate agile scrum or Kanban team, that incidental interaction only 

happens if their tables are co-located in the same pod. If we locate all of the Web Java teams next to each other

in an isolated pod, we lose that opportunity for unstructured innovation.

Agile development optimizes development. For it to benefit unstructured innovation, the agile teams should be 

located to force collaborative interaction, either physically (as Apple and Bell Labs did) or organizationally via 

task forces, cross-team programs, special initiatives, etc. We can then celebrate the successes of the teams, 

and at the end of the day the whole can be greater than the sum of its parts.


Interim thoughts from reading The Innovators:

The early days of the digital revolution featured a more collaborative and academic environment than we have 

now. Grace Hopper’s initial compilers were sent out to colleagues for their review - the first open source 

software. John von Neumann led seminars with a wide range of physicists and engineers and then published 

the results widely with the intent that the publication prevent patentability of key concepts by the individuals. 

It was a different time, especially between WWI and WWII, but it should not take world war for individuals to 

collaborate without the benefit of IPOs.


The integration of the physicists and engineers occurred on campuses, and then in government projects (see

 Brighter than a Thousand Suns for the history of the Manhattan Project). When the engineers’ voices are 

heard by the architects, and the architects visions are understood by the engineers, we can eliminate projects 

and features that we know are not needed or are not going to work. It is the role of tech leadership to facilitate 

these communications and drive them to occur. We can’t build a coherent product unless we have a coherent

understanding of what we are all trying to accomplish and why. What is not clear will not get done.


Final book completed for the year: The Art of Action by Stephen Bungay. It was mentioned in a weekly blog 

entry by an executive at Barclays, in reference to a speaker who had been in charge of UK military operations. 

Bungay is a military historian and a business consultant, and he makes it clear from the outset that he sees little 

difference between the two. His approach is geared toward developing a corporate leadership style that is 

opportunistic and direct. While you might expect a book like this to get more vague as it goes along - starting 

with a couple of historical examples, generate some concepts, and then leaving the rest to the reader - Bungay 

instead keeps providing detailed maps to follow when implementing the approaches to closing the knowledge 

gap, the alignment gap, and the effects gap.


In many companies, new leaders are promoted without education in the field. They are placed in positions of 

responsibility without guidance and we are then surprised that they continue to function as they always have. 

If we do that to new leaders, we do them a disservice. We should at a minimum start them out with 

The Art of Action and Goldratt’s The Goal so they can understand both their mission and how they need to act 

in order to accomplish it.


That is the least we can do.

Bungay’s The Art of Action is really a compelling book.  While some books run out of material and get less 

concrete as they go along (leaving the reader to figure out what to do with the general principles mentioned 

at the start), he actually gets more concrete as he goes along.  I feel like I should get copies of this for every 

manager I’ve had.


The translator’s notes starting on page 691 just amounted to the first computer programming written, including 

the first nested loops and subroutines, a century before the computer existed to execute them.  The author was 

Ada Byron, Lady Lovelace:


Finished reading Command and Control (  As a follow-on to earlier comments: 

In the last sections, he also brings up the issue of the risks inherent in tightly coupled designs.  This is a known 

issue in industrial designs (chemical plants, processing plants, etc), and also shows up in software - a problem 

in one component is not isolated from impacting components elsewhere, and may in fact impact the ability to 

monitor the first impact.


The example Schlosser cites is from another book, Perrow’s Normal Accidents, in which two failures could interact 

such as to both start a fire and silence the fire alarm.  In software, as code repositories age and morph, their 

problems tend to become more complex - we didn’t know that when X and Y and Z all happened at the same 

time, that would be a bad thing, and Z had disabled the monitor for X’s fault condition…

If that sounds like a made up scenario, try asking for a similar scenario during your next interview of a Java tech lead.

I started reading Command and Control for what it has to say about how organizations manage complex systems. 

Then as the story developed it became about how you monitor highly dangerous components that are now being 

used in ways they were never designed for.  But in agile terms, it is a story about outrageous technical debt, and 

the very real penalty paid for not resolving it.


Many sites I’ve heard from have created recursive data warehouses without even thinking about it.  They’ve created 

a traditional Inmon/Kimball data warehouse that feeds data through staging/integration/ADS layers with the ultimate

destination being a data mart that users query.  The data lands in multiple places along the way, with a cost at each

landing site (both in terms of tier 1 storage and, more importantly to the business, in terms of data latency as each 

of these successive data movement steps completes).


The data is then queried from the data mart (its fourth landing area) by SAS users.  They query multiple tables (creating

their own SAS-based staging area), integrate them within SAS, and do custom reporting (SAS marts). Within SAS, they

recreate the Inmon/Kimball approach, using the enterprise data warehouse as the input to their warehouse.  And in this

model, the data is now landing in at least 7 different places before it actually generates business value.  This is designed

to be a sequential process that will only get slower as the business and data volumes grow.


Big data, properly implemented, lets you solve the business issues (data latency and tier 1 storage costs).  What if, 

in your enterprise data warehouse, you move the staging and integration layers to Hadoop, then create a virtual mart

via Hive that gives the SAS users direct access to those tables?  The SAS users would bypass 5 landing zones for 

the data, and 5 steps that added data latency and storage costs.  There are certainly issues that come along with 

this configuration - security, tuning, resource sharing, workflows, new tools, training - but the payoff is a systemic fix 

and the removal of a recursive data warehouse.  The core issue with a recursive data warehouse is a design issue, 

and the core fix is a design fix.

Each new study reporting on big data projects is reporting a higher number of them failing.  This trend is troubling for

multiple reasons, chief among them being that it means the overall failure rate is escalating while the people initiating 

these efforts are among the most analytical you can find.  What is happening?  My next major talk may be on this

topic - why big data projects fail.

One of the recurring themes for these failures is that the projects are initiated without a clear ROI case - maybe you

started doing big data because your competitors were doing big data and you didn’t want to be left behind.  The initial

business case was to replicate an existing business activity with new technology…

So, think about that - in many cases, your initial project is planning to solve yesterday’s problems with tomorrow’s

technology.  With the highest priced resources out there.  With open source software.  With first generation toolsets. 

It’s difficult, and when the bills start coming in, someone finally asks about the ROI.  If you have not done a pilot that

demonstrates how you plan to use this new capability, your ROI consists of future operational savings when you replace

your existing analytics with the big data version; and that may not be enough to keep the project moving forward.  Make

the business case first, in dollars and in new business capabilities.  Make yesterday’s solutions scale while enabling

analysis you could never have performed before, differentiating you from your competition.


Taylor & Vargo's Learning Chef is now available in print. It's been available on safaribooks for some time now, and it's

a must read for those getting up to speed on enterprise scale infrastructure delivery. Solid introduction for novices, 

plus recipes to use as templates.




12/14: re-starting the site blog.  See my linkedin page for for book reviews from earlier in the year.