gid

August 10, 2006

Horizontal vs. Vertical Database Design

I was at one point tasked with developing the argument against going with a vertical database design for a particular project. In my research one of the main reasons for going vertical was the null saturation percent of a database, which was caused by the number of columns added to support a very large eCommerce database. As some research projects go I never got a chance to finish, and because I have a feeling that one day in the future I will have to revisit this subject I am archiving what I had done out here on my blog.

The one neat thing that did come out of my research was this query that I wrote to get the null saturation of a MS SQL Server database. If you are curious about the null saturation of a SQL Server database then feel free to play with this query. Depending on the size of the database it could take up to 5 minutes or longer to run. I'm sure the query could have been written better, but it was written for research not as a production piece of code. Also, this query can not be run under the master database.

Download Query (the query in this file will make more sense because it still has some formatting to it)

--Written By David Gidcumb

create TABLE #NumColumnsRowsPerTable(

ID INT

, Name varchar(100)

, Columns INT

, Rows bigINT

, TotalCellCount bigint

, NonNullCellsPerTable bigINT

, NullsPerTable bigint

)

 

create TABLE #Columns(

Name varchar(100)

, NotNullCount bigint

)

 

--Number of Columns per table

INSERT INTO #NumColumnsRowsPerTable

select so.ID

, so.Name

, count(*)

, 0

, 0

, -1

, 0

from sysobjects so

left join

syscolumns sc

on so.ID = sc.ID

where so.type = 'U'

and sc.xtype not in(35,36,34,99) --none nullable types

group by so.name, so.ID

 

Declare @tablename varchar(100)

Declare @columnname varchar(100)

Declare @SelectCMD varchar(2048)

Declare @SelectCMD2 varchar(2048)

Declare @SelectCMD3 varchar(2048)

-- Declare the Cursor for getting Row Count

DECLARE GetRowCount_table_names_cursor CURSOR FOR

SELECT Name

FROM #NumColumnsRowsPerTable

 

OPEN GetRowCount_table_names_cursor

FETCH NEXT FROM GetRowCount_table_names_cursor

 

INTO @tablename

 

WHILE @@FETCH_STATUS = 0

BEGIN

 

set @SelectCMD = 'Update #NumColumnsRowsPerTable set rows = (select count(*) from xxxTABLExxx), TotalCellCount =(columns * (select count(*) from xxxTABLExxx)) where name = ''xxxTABLExxx'''

select @SelectCMD2 = REPLACE ( @SelectCMD , 'xxxTABLExxx' , @tablename )

execute (@SelectCMD2)

 

INSERT INTO #Columns

select sc.Name

, 0

from sysobjects so

left join

syscolumns sc

on so.ID = sc.ID

where so.type = 'U'

and so.Name = @tablename

and sc.xtype not in(35,36,34,99)

 

Declare @NotNullCount int

-- Declare the Cursor for getting Null Count

DECLARE GetNotNullCount_cursor CURSOR FOR

SELECT Name

FROM #Columns

 

OPEN GetNotNullCount_cursor

FETCH NEXT FROM GetNotNullCount_cursor

INTO @columnname

WHILE @@FETCH_STATUS = 0

BEGIN

set @SelectCMD = 'update #Columns set NotNullCount =(select count(xxxCOLUMNxxx) from xxxTABLExxx where xxxCOLUMNxxx is not null) where name = ''xxxCOLUMNxxx'''

select @SelectCMD2 = REPLACE ( @SelectCMD , 'xxxTABLExxx' , @tablename )

select @SelectCMD3 = REPLACE ( @SelectCMD2 , 'xxxCOLUMNxxx' , @columnname )

print @SelectCMD3

execute (@SelectCMD3)

--select @SelectCMD3

-- Get the next table name

FETCH NEXT FROM GetNotNullCount_cursor

INTO @columnname

End

CLOSE GetNotNullCount_cursor

DEALLOCATE GetNotNullCount_cursor

 

Update #NumColumnsRowsPerTable set NonNullCellsPerTable = (select sum(NotNullCount) from #columns), NullsPerTable =(TotalCellCount-(select sum(NotNullCount) from #columns)) where name = @tablename

delete from #Columns

-- Get the next table name

FETCH NEXT FROM GetRowCount_table_names_cursor

INTO @tablename

END

 

CLOSE GetRowCount_table_names_cursor

DEALLOCATE GetRowCount_table_names_cursor

GO

 

select * from #NumColumnsRowsPerTable

 

select cast(((sum(NullsPerTable)*100)/sum(TotalCellCount)) as varchar(100)) + '% Null Saturation' from #NumColumnsRowsPerTable

drop table #NumColumnsRowsPerTable

drop table #Columns

Please feel free to ignore everything below the fold. It was a work in progress that is woefully incomplete.

History
In the past horizontal databases where used exclusively in database design. But, as e-commerce gained in popularity the shortcomings of the Horizontal design began to become apparent. At first the solution was to just add new columns to support the unique attribute of each new product being added. As new columns were added problems unique to a Horizontal design were encountered.
They were:
1. Max column count exceeded (DB2\Oracle 1012 SQL Server 1024)
2. Null saturation of a table would start to exceed 90%
3. Constant Scheme changes due to adding columns
4. Performance issues are encountered when a table is very wide and only a few columns are returned in the query
Vertical Database Design
The solution to the e-commerce predicament was to change the design by taking the attributes (column names) in the horizontal design and making them keys in the vertical design. This change solved all the issues, which occurred under a horizontal design. Though there are problems that resulted which are unique to a vertical database design.
1. Queries written against the vertical design became cumbersome and error-prone.
? Multiple joins to pull back each attribute
? Maximum of 256 tables for each select statement(SQL Server) run query contained in [256 Table proof for Horizontal design.xls] as example
? See file query.sql as example of what needs to happen to pull back data.
2. Current Application development tools are designed for storing data object in a horizontal format. Data displayed in a horizontal format in a ADO.Net datagrid will make no sense to a client. Data would have to be converted from vertical to horizontal to make sense. Again see bullet 1.
3. Logical horizontal views must be written on top of vertical representation
? SQL Server view performance is poor when compared to teradata, and views execution plan must be recompiled when accessed.
4. Data is not strong typed do to the nature of the vertical design, which causes casts or converts to be needed for most attributes (columns) being returned. The biggest issue is that for a large enterprise it removes the natural typing of data at the database level, and instead moves to responsibility to each developer who inserts and updates the data-store.
The decision on weather to use a Vertical or Horizontal data design really depends on the scope of a project. The decision really needs to be made not only by the Data Analyst, but also the developers who are going to be interacting with the data store on a daily bases. Chances are that the Data Analyst left to their own doing will produce a vertical design. It is much easier to design because if a new item needs to be added all that has to be done is a simple insert into a table. But just because it is simple to design it does not mean it is the most appropriate design pattern to apply.
Reason's to Use a Vertical Database design
? Due to the constant adding of products in an e-commerce system i.e. Amazon, eBay and the like, new products with unique attributes are added by adding new columns, which describe the unique attributes of each new product a vertical architecture simplifies the issue for the data analyst. Issues that Vertical Design negate
1. Column count can exceed that 1024 for Microsoft SQL Server
2. Inserts and Stored procedure parameters can contain a maximum of 1024 columns

Reasons to Use a Horizontal Database design .
.
.
.

Views 256 tables is the max in a select statement
Posted by gid at August 10, 2006 06:53 PM
Comments

Great. This was exactly what I was looking for. Thanks for putting this together.

Posted by: kettsar at June 17, 2008 05:14 PM

I'm having a big2 performance trouble with Vertical Database Design used for product management.

Here is the background:
Product 1--* Variation 1--* Product_Information 1--* Variation (for upsell etc)

As you can see from that simple relations to retrieve a product information for one product is a nightmare.

Plus, there is a database logic to optimise the storage so for example (if a product variation doesn't have the information which is stored vertically in the product_information it will use the default value from a parent product_variation)

Any suggestion of how to optimize it from Front End or SQL Select?

Personally In performance and storage the DB is small but when you perform a query with complex logic it required a long SQL select and for the front end it required multiple call to database.

Same pain when you perform a DBEntry programming (is a nightmare)

I tried to create simple join and output an XML from SQL. I think it will reduce the SQL Server CPU consumption but, will it kill the WebServer (if it is transformed programmatically), and i'm sure it will kill the user if I spit up the XML or make it a phisical product per product XML file accesible to public.

Would (XSLT + XML) from a physical file better than Database Call?

I wonder how Amazon done it with millions of products.

Posted by: Edwin at November 11, 2008 12:06 AM

Edwin,

XML in theory should never be faster than a DB call. I know I am working with a project now that uses a vertical data model extensively, and the only time there is a performance issue is when we are trying to pull back 100s of items at one time. What data base are you using? Are you using indexes where appropriate? Have you run statistics on the db?

Posted by: gid at November 11, 2008 07:15 AM
Post a comment









Please Enter Number:



Remember personal info?