출처 : http://the-paper-trail.org/blog/columnar-storage/


Columnar Storage

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.

Caveat: This is a personal, general research summary post, and as usual doesn’t neccessarily reflect our thinking at Cloudera about columnar storage.

Disks are still the major bottleneck in query execution over large datasets. Even a machine with twelve disks running in parallel (for an aggregate bandwidth of north of 1GB/s) can’t keep all the cores busy; running a query against memory-cached data can get tens of GB/s of throughput. IO bandwidth matters. Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk. This can mean decreasing the latency (for small queries where the time to find the data to read might dominate), but most usually this means improving the effective throughput of reads from disk.

The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster. However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.

The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem. Enter columnar storage, a principle for file format design that aims to do exactly that for query engines that deal with record-based data.

Columns vs. Rows

Traditional database file format store data in rows, where each row is comprised of a contiguous collection of column values. On disk, that looks roughly like the following:

Row-major On-disk Layout

This row-major layout usually has a header for each row that describes, for example, which columns in the row are NULL. Each column value is then stored contiguously after the header, followed by another row with its own header, and so on.

Both HDDs and SSDs are at their most efficient when reading data sequentially from disk (for HDDs the benefits are particularly pronounced). In fact, even a read of a few bytes usually brings in an entire block of 4096 bytes from disk, because it is effectively the same cost to read (and the operating system usually deals with data in 4k page-sized chunks). For row-major formats it’s therefore most efficient to read entire rows at a time.

Queries that do full table-scans – i.e. those that don’t take advantage of any kind of indexing and need to visit every row – are common in analytical workloads; with row-major formats a full scan of a table will read every single byte of the table from disk. For certain queries, this is appropriate. Trivially, SELECT * FROM table requires returning every single column of every single row in the table, and so the IO costs for executing that query on a row-major format are a single-seek and a single large contiguous read (although that is likely to be broken up for pipelining purposes). The read is unavoidable, as is the single seek; therefore row-major formats allow for optimal IO usage. More generally, SELECT <col_set> FROM table WHERE <predicate_set> will be relatively efficient for row-major formats if either a) evaluating the predicate_set requires reading a large subset of the set of columns or b) col_set is a large subset of the set of columns (i.e. the projectivity is high) and the set of rows returned by the evaluation of the predicates over the table is a large proportion of the total set of rows (i.e. the selectivity is high). More simply, a query is going to be efficient if it requires reading most of the columns of most of the rows. In these cases, row-major formats allow the query execution engine to achieve good IO efficiency.

However, there is a general consensus that these SELECT * kinds of queries are not representative of typical analytical workloads; instead either a large number of columns are not projected, or they are projected only for a small subset of rows where only a few columns are required to decide which rows to return. Coupled with a general trend towards very wide tables with high column counts, the total number of bytes that are required to satisfy a query are often a relatively small fraction of the size on disk of the target table. In these cases, row-major formats often are quite wasteful in the amount of IO they require to execute a query.

Instead of a format that makes it efficient to read entire rows, it’s advantageous for analytical workloads to make it efficient to read entire columns at once. Based on our understanding of what makes disks efficient, we can see that the obvious approach is to store columns values densely and contiguously on disk. This is the basic idea behind columnar file formats. The following diagram shows what this looks like on disk:

Column-Major On-disk Layout

A row is split across several column blocks, which may even be separate files on disk. Reading an entire column now requires a single seek plus a large contiguous read, but the read length is much less than for extracting a single column from a row-major format. In this figure we have organised the columns so that they are all ordered in the same way; later we’ll see how we can relax that restriction and use different orderings to make different queries more efficient.

Query Execution

The diagram below shows what a simple query plan for SELECT col_b FROM table WHERE col_a > 5 might look like for a query engine reading from a traditional row-major file format. A scan node reads every row in turn from disk, and streams the rows to a predicate evaluation node, which looks at the value of col_a in each row. Those rows that pass the predicate are sent to a projection node which constructs result tuples containing col_b.

Row Query Plan

Compare that to the query plan below, for a query engine reading from columnar storage. Each column referenced in the query is read independently. The predicate is evaluated over col_a to produce a list of matching row IDs. col_b is then scanned with respect to that list of IDs, and each matching value is returned as a query result. This query plan performs two IO seeks (to find the beginning of both column files), instead of one, and issues two consecutive reads rather than one large read. The pattern of using IDs for each column value is very common to make reconstructing rows easier; usually columns are all sorted on the same key so the Nth value of col_a belongs to the same row as the Nth value of col_b.

Columnar Query Plan

The extra IO cost for the row-format query is therefore the time it takes to read all those extra columns. Let’s assume the table is 10 columns wide, ten million rows long and each value is 4 bytes, which are all conservative estimates. Then there is an extra 8 * 1M * 4 bytes, or 32MB of extra data read, which is ~3.20s on a query that would likely otherwise take 800ms; an overhead of 300%. When disks are less performant, or column widths wider, the effect becomes exaggerated.

This, then, is the basic idea of columnar storage: we recognise that analytical workloads rarely require full scans of all table data, but do often require full scans of a small subset of the columns, and so we arrange to make column scans cheap at the expense of extra cost reading individual rows.

The Cost of Columnar

Is this a free lunch? Should every analytical database go out and change every file format to be column-major? Obviously the story is more complicated than that. There are some query archetypes that suffer when data is stored in a columnar format.

The obvious drawback is that it is expensive to reassemble a row, since the separate values that comprise it are spread far across the disk. Every column included in a projection implies an extra disk seek, and this can add up when the projectivity of a query is high. Therefore, for highly projective queries, row-major formats can be more efficient (and therefore columnar formats are not strictly better than row-major storage even from a pure IO perspective).

There are more subtle repurcussions of each row being scattered across the disk. When a row-major format is read into memory, and ultimately into CPU cache, it is in a format that permits cheap reference to multiple columns at a time. Row-major formats have good in-memory spatial locality, and there are common operations that benefit enormously from this.

For example, a query that selects the sum of two columns can sometimes be executed (once the data is in memory) faster on row-major formats, since the columns are almost always in the same cache line for each row. Columnar representations are less well suited; each column must be brought into memory at the same time and moved through in lockstep (yet this is still not cache efficient if each column is ordered differently), or the initial column must be scanned, each value buffered and then the second column scanned separately to complete the half-finished output tuple.

The same general problem arises when preparing each tuple to write out as a result of (non-aggregating) query. Selecting several columns at once requires ‘row reconstruction’ at some point in the query lifecycle. Deciding when to do this is a complicated process, and (as we shall see) the literature has not yet developed a good rule of thumb. Many databases are row-major internally, and therefore a columnar format is transposed into a row-major one relatively early in the scanning process. As described above, this can require buffering half-constructed tuples in memory. For this reason, columnar formats are often partiioned into ‘row-groups’; each column chunk N contains rows (K*N) to ((K+1) * N). This reduces the amount of buffering required, at the cost of a few more disk seeks.

Further Aspects of Columnar Storage

Fully column-oriented execution engines

Relevant papers:
C-Store: A Column-oriented DBMS
The Vertica Analytic Database: C-Store 7 Years Later
Materialization Strategies in a Column-Oriented DBMS
Performance Tradeoffs in Read-Optimized Databases
Column-Stores vs. Row-Stores: How Different Are They Really?

In this post, I’ve talked mostly about the benefits of columnar storage for scans – query operators that read data from disk, but whose ultimate output is a batch of rows for the rest of the query plan to operate on. In fact, columnar data can be integrated into pretty much every operator in a query execution engine. C-Store, the research project precursor to Vertica, explored a lot of the consequences of keeping data in columns until later on in the query plan. Eventually, of course, the columns have to be converted to rows, since the user expects a result in row-major format. The choice of when to perform this conversion is called late or early materialisation; viewed this way column-stores and row-stores can be considered two points on a spectrum of early to late materialisation strategies. Materialisation is studied in detail in the materialisation strategies paper above. Their conclusions are that the correct time to construct a tuple depends on the query plan (two broad patterns are considered: pipelining and parallel scans) and the query selectivity. Unfortunately, supporting both strategies would involve significant implementation cost – each operator would have to support two interfaces, and two parallel execution engines would effectively be frankensteined together. In general, late materialisation can lead to significant advantages: for example, by delaying the cost of reconstructing a tuple, it can be avoided if the tuple is ultimately filtered out by a predicate.

The difference between row-based and columnar execution engines is studied in the Performance Tradeoffs… and Column-Stores vs. Row-Stores… papers. The former takes a detailed look at when each strategy is superior – coming out in favour mostly of column-stores, but only with simple queries and basic query plans. The latter tries to implement common column-store optimisations in a traditional row-store, without changing the code. This means a number of increasingly brittle hacks to emulate columnar storage.

Compression

Relevant papers:
Integrating Compression and Execution on Column-Oriented Database Systems

A column of values drawn from the same set (like item price, say) is likely to be highly amenable to compression since the values contained are similar, and often identical. Compressing a column has at least two significant advantages on IO cost: less space is required on disk, and less IO required to bring a column into memory (at the cost of some CPU to decompress which is usually going spare). Some compression formats – for example run-length encoding – allow execution engines to operate on the compressed data directly, filtering large chunks at a time without first decompressing them. This is another advantage of late materialisation – by keeping the data compressed until late in the query plan, these optimisations become available to many operators, not just the scan.

Hybrid approaches

Relevant papers:
Weaving Relations for Cache Performance

Since neither row-major nor column-major is strictly superior on every workload, it’s natural that some research has been done into hybrid approaches that can achieve the best of both worlds. The most commonly known approach is PAX – Partition Attributes Across – which splits the table into page-sized groups of rows, and inside those groups formats the rows in column-major order. This is the same approach as the row-groups used to prevent excessive buffering described earlier, but this is not the aim of PAX; with PAX the original intention was to make CPU processing more efficient by having individual columns available contiguously to perform filtering, but also to have all the columns for a particular row nearby inside a group to make tuple reconstruction cheaper. The result of this approach is that IO costs don’t go down (because each row-group is only a page long, and is therefore read in its entirety), but reconstruction and filtering is cheaper than for true columnar formats.


출처 : http://www.smartdatacollective.com/kingmesal/386160/apache-drill-vs-apache-spark-what-s-right-tool-job


11
70
81

Image

If you’re looking to implement a big data project, you’re probably deciding whether to go with Apache Spark SQL or Apache Drill. This article can help you decide which query tool you should use for the kinds of projects you’re working on.

Spark SQL

Spark SQL is simply a module that lets you work with structured data using Apache Spark. It allows you to mix SQL within your existing Spark projects. Not only do you get access to a familiar SQL query language, you also get access to powerful tools such as Spark Streaming and the MLlib machine learning library.

Spark uses a special data structure called a DataFrame that represents data as named columns, similar to relational tables. You can query the data from Scala, Python, Java, and R. This enables you to perform powerful analysis of your data rather than just retrieving it. But it’s even more powerful when extracting data for use with the machine learning library. With MLlib, you can perform sophisticated analyses, detect credit card fraud, and process data coming from servers.

As with Drill, Spark SQL is compatible with a number of data formats, including some of the same ones that Drill supports: Parquet, JSON, and Hive. Spark SQL can handle multiple data sources similar to the way Drill can, but you can funnel the data into your machine learning systems mentioned earlier. This gives you a lot of power to analyze multiple data points, especially when combined with Spark Streaming. Spark SQL serves as a way to glue together different data sources and libraries into a powerful application.

Apache Drill

Apache Drill is a powerful database engine that also lets you use SQL for queries. You can use a number of data formats, including Parquet, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS, and more.

You can use data from multiple data sources and join them without having to pull the data out, making Drill especially useful for business intelligence.

The ability to view multiple types of data, some of which have both strict and loose schema, as well as being able to allow for complex data models, might seem like a drag on performance. However, Drill uses schema discovery and a hierarchical columnar data model to treat data like a set of tables, independently of how the data is actually modeled. 

Almost all existing BI tools, including Tableau, Qlik, MicroStrategy, Spotfire, SAS, and even Excel, can use Drill’s JDBC and ODBC drivers to connect to it. This makes Drill very useful for people already using BI and SQL databases to move up to big data workloads using tools they’re already familiar with.

Drill’s JDBC driver lets BI tools access Drill. JDBC lets developers query large datasets using Java. This has a similar advantage that using ANSI SQL does: lots of developers are already familiar with Java and can transfer their skills to Drill.

Easy Data Access in Drill

One of Drill’s biggest strengths is its ability to secure databases at the file level using views and impersonation.

Views within Drill are the same as those within relational databases. They allow a simplified query to hide the complexities of the underlying tables. Impersonation allows a user to access data as another user. This enables fine-grained access to the raw data when other members of your team should not be able to view sensitive or secure data.

Views and impersonation are beyond the scope of Apache Spark.

Conclusion

So which query engine should you choose? As always, it depends. If you’re mainly looking to query data quickly, even across multiple data sources, then you should look into Drill. If you want to go beyond querying data and work with data in more algorithmic ways, then Spark SQL might be for you. You can always test both out by playing around in your own Sandbox environment, which lets you play around with these powerful systems on your own machine.



Authored by:

Jim Scott

James A. Scott (prefers to go by Jim) is Director, Enterprise Strategy & Architecture at MapR Technologies and is very active in the Hadoop community. Jim helped build the Hadoop community in Chicago as cofounder of the Chicago Hadoop Users Group. He has implemented Hadoop at three different companies, supporting a variety of enterprise use cases from managing Points of Interest for mapping ...

See complete profile


'빅데이터' 카테고리의 다른 글

Columnar Storage  (0) 2016.07.14
Hello, TensorFlow!  (0) 2016.07.08
분산 로그 수집기 Fluentd 소개  (0) 2016.06.14
람다 아키텍처(Lambda Architecture)  (0) 2016.05.18
Lambda Architecture  (0) 2016.05.18

출처 : https://www.oreilly.com/learning/hello-tensorflow


Hello, TensorFlow!

Building and training your first TensorFlow graph from the ground up.

The TensorFlow project is bigger than you might realize. The fact that it's a library for deep learning, and its connection to Google, has helped TensorFlow attract a lot of attention. But beyond the hype, there are unique elements to the project that are worthy of closer inspection:

  • The core library is suited to a broad family of machine learning techniques, not “just” deep learning.
  • Linear algebra and other internals are prominently exposed.
  • In addition to the core machine learning functionality, TensorFlow also includes its own logging system, its own interactive log visualizer, and even its own heavily engineered serving architecture.
  • The execution model for TensorFlow differs from Python's scikit-learn, or most tools in R.

Cool stuff, but—especially for someone hoping to explore machine learning for the first time—TensorFlow can be a lot to take in.

Get O'Reilly's weekly data newsletter

How does TensorFlow work? Let's break it down so we can see and understand every moving part. We'll explore the data flow graph that defines the computations your data will undergo, how to train models with gradient descent using TensorFlow, and how TensorBoard can visualize your TensorFlow work. The examples here won't solve industrial machine learning problems, but they'll help you understand the components underlying everything built with TensorFlow, including whatever you build next!

Names and execution in Python and TensorFlow

The way TensorFlow manages computation is not totally different from the way Python usually does. With both, it's important to remember, to paraphrase Hadley Wickham, that an object has no name (see Figure 1). In order to see the similarities (and differences) between how Python and TensorFlow work, let’s look at how they refer to objects and handle evaluation.

Names “have” objects, rather than the reverse
Figure 1. Names “have” objects, rather than the reverse. Image courtesy of Hadley Wickham, used with permission.

The variable names in Python code aren't what they represent; they're just pointing at objects. So, when you say in Python that foo = [] and bar = foo, it isn't just that foo equals bar; foo is bar, in the sense that they both point at the same list object.

>>> foo = []
>>> bar = foo
>>> foo == bar
## True
>>> foo is bar
## True

You can also see that id(foo) and id(bar) are the same. This identity, especially with mutable data structures like lists, can lead to surprising bugs when it's misunderstood.

Internally, Python manages all your objects and keeps track of your variable names and which objects they refer to. The TensorFlow graph represents another layer of this kind of management; as we’ll see, Python names will refer to objects that connect to more granular and managed TensorFlow graph operations.

When you enter a Python expression, for example at an interactive interpreter or Read Evaluate Print Loop (REPL), whatever is read is almost always evaluated right away. Python is eager to do what you tell it. So, if I tell Python to foo.append(bar), it appends right away, even if I never use foo again.

A lazier alternative would be to just remember that I said foo.append(bar), and if I ever evaluate foo at some point in the future, Python could do the append then. This would be closer to how TensorFlow behaves, where defining relationships is entirely separate from evaluating what the results are.

TensorFlow separates the definition of computations from their execution even further by having them happen in separate places: a graph defines the operations, but the operations only happen within a session. Graphs and sessions are created independently. A graph is like a blueprint, and a session is like a construction site.

Back to our plain Python example, recall that foo and bar refer to the same list. By appending bar into foo, we've put a list inside itself. You could think of this structure as a graph with one node, pointing to itself. Nesting lists is one way to represent a graph structure like a TensorFlow computation graph.

>>> foo.append(bar)
>>> foo
## [[...]]

Real TensorFlow graphs will be more interesting than this!

The simplest TensorFlow graph

To start getting our hands dirty, let’s create the simplest TensorFlow graph we can, from the ground up. TensorFlow is admirably easier to install than some other frameworks. The examples here work with either Python 2.7 or 3.3+, and the TensorFlow version used is 0.8.

>>> import tensorflow as tf

At this point TensorFlow has already started managing a lot of state for us. There's already an implicit default graph, for example. Internally, the default graph lives in the _default_graph_stack, but we don't have access to that directly. We use tf.get_default_graph().

>>> graph = tf.get_default_graph()

The nodes of the TensorFlow graph are called “operations,” or “ops.” We can see what operations are in the graph with graph.get_operations().

>>> graph.get_operations()
## []

Currently, there isn't anything in the graph. We’ll need to put everything we want TensorFlow to compute into that graph. Let's start with a simple constant input value of one.

>>> input_value = tf.constant(1.0)

That constant now lives as a node, an operation, in the graph. The Python variable name input_value refers indirectly to that operation, but we can also find the operation in the default graph.

>>> operations = graph.get_operations()
>>> operations
## [<tensorflow.python.framework.ops.Operation at 0x1185005d0>]
>>> operations[0].node_def
## name: "Const"
## op: "Const"
## attr {
##   key: "dtype"
##   value {
##     type: DT_FLOAT
##   }
## }
## attr {
##   key: "value"
##   value {
##     tensor {
##       dtype: DT_FLOAT
##       tensor_shape {
##       }
##       float_val: 1.0
##     }
##   }
## }

TensorFlow uses protocol buffers internally. (Protocol buffers are sort of like a Google-strength JSON.) Printing the node_def for the constant operation above shows what's in TensorFlow's protocol buffer representation for the number one.

People new to TensorFlow sometimes wonder why there's all this fuss about making “TensorFlow versions” of things. Why can't we just use a normal Python variable without also defining a TensorFlow object? One of the TensorFlow tutorials has an explanation:

To do efficient numerical computing in Python, we typically use libraries like NumPy that do expensive operations such as matrix multiplication outside Python, using highly efficient code implemented in another language. Unfortunately, there can still be a lot of overhead from switching back to Python every operation. This overhead is especially bad if you want to run computations on GPUs or in a distributed manner, where there can be a high cost to transferring data.

TensorFlow also does its heavy lifting outside Python, but it takes things a step further to avoid this overhead. Instead of running a single expensive operation independently from Python, TensorFlow lets us describe a graph of interacting operations that run entirely outside Python. This approach is similar to that used in Theano or Torch.

TensorFlow can do a lot of great things, but it can only work with what's been explicitly given to it. This is true even for a single constant.

If we inspect our input_value, we see it is a constant 32-bit float tensor of no dimension: just one number.

>>> input_value
## <tf.Tensor 'Const:0' shape=() dtype=float32>

Note that this doesn't tell us what that number is. To evaluate input_value and get a numerical value out, we need to create a “session” where graph operations can be evaluated and then explicitly ask to evaluate or “run” input_value. (The session picks up the default graph by default.)

>>> sess = tf.Session()
>>> sess.run(input_value)
## 1.0

It may feel a little strange to “run” a constant. But it isn't so different from evaluating an expression as usual in Python; it's just that TensorFlow is managing its own space of things—the computational graph—and it has its own method of evaluation.

The simplest TensorFlow neuron

Now that we have a session with a simple graph, let's build a neuron with just one parameter, or weight. Often, even simple neurons also have a bias term and a non-identity activation function, but we'll leave these out.

The neuron's weight isn't going to be constant; we expect it to change in order to learn based on the “true” input and output we use for training. The weight will be a TensorFlow variable. We'll give that variable a starting value of 0.8.

>>> weight = tf.Variable(0.8)

You might expect that adding a variable would add one operation to the graph, but in fact that one line adds four operations. We can check all the operation names:

>>> for op in graph.get_operations(): print(op.name)
## Const
## Variable/initial_value
## Variable
## Variable/Assign
## Variable/read

We won't want to follow every operation individually for long, but it will be nice to see at least one that feels like a real computation.

>>> output_value = weight * input_value

Now there are six operations in the graph, and the last one is that multiplication.

>>> op = graph.get_operations()[-1]
>>> op.name
## 'mul'
>>> for op_input in op.inputs: print(op_input)
## Tensor("Variable/read:0", shape=(), dtype=float32)
## Tensor("Const:0", shape=(), dtype=float32)

This shows how the multiplication operation tracks where its inputs come from: they come from other operations in the graph. To understand a whole graph, following references this way quickly becomes tedious for humans. TensorBoard graph visualization is designed to help.

How do we find out what the product is? We have to “run” the output_value operation. But that operation depends on a variable: weight. We told TensorFlow that the initial value of weight should be 0.8, but the value hasn't yet been set in the current session. The tf.initialize_all_variables() function generates an operation which will initialize all our variables (in this case just one) and then we can run that operation.

>>> init = tf.initialize_all_variables()
>>> sess.run(init)

The result of tf.initialize_all_variables() will include initializers for all the variables currently in the graph, so if you add more variables you'll want to use tf.initialize_all_variables() again; a stale init wouldn't include the new variables.

Now we're ready to run the output_value operation.

>>> sess.run(output_value)
## 0.80000001

Recall that's 0.8 * 1.0 with 32-bit floats, and 32-bit floats have a hard time with 0.8; 0.80000001 is as close as they can get.

See your graph in TensorBoard

Up to this point, the graph has been simple, but it would already be nice to see it represented in a diagram. We'll use TensorBoard to generate that diagram. TensorBoard reads the name field that is stored inside each operation (quite distinct from Python variable names). We can use these TensorFlow names and switch to more conventional Python variable names. Using tf.mul here is equivalent to our earlier use of just * for multiplication, but it lets us set the name for the operation.

>>> x = tf.constant(1.0, name='input')
>>> w = tf.Variable(0.8, name='weight')
>>> y = tf.mul(w, x, name='output')

TensorBoard works by looking at a directory of output created from TensorFlow sessions. We can write this output with a SummaryWriter, and if we do nothing aside from creating one with a graph, it will just write out that graph.

The first argument when creating the SummaryWriter is an output directory name, which will be created if it doesn't exist.

>>> summary_writer = tf.train.SummaryWriter('log_simple_graph', sess.graph)

Now, at the command line, we can start up TensorBoard.

$ tensorboard --logdir=log_simple_graph

TensorBoard runs as a local web app, on port 6006. (“6006” is “goog” upside-down.) If you go in a browser to localhost:6006/#graphs you should see a diagram of the graph you created in TensorFlow, which looks something like Figure 2.

A TensorBoard visualization of the simplest TensorFlow neuron
Figure 2. A TensorBoard visualization of the simplest TensorFlow neuron.

Making the neuron learn

Now that we’ve built our neuron, how does it learn? We set up an input value of 1.0. Let's say the correct output value is zero. That is, we have a very simple “training set” of just one example with one feature, which has the value one, and one label, which is zero. We want the neuron to learn the function taking one to zero.

Currently, the system takes the input one and returns 0.8, which is not correct. We need a way to measure how wrong the system is. We'll call that measure of wrongness the “loss” and give our system the goal of minimizing the loss. If the loss can be negative, then minimizing it could be silly, so let's make the loss the square of the difference between the current output and the desired output.

>>> y_ = tf.constant(0.0)
>>> loss = (y - y_)**2

So far, nothing in the graph does any learning. For that, we need an optimizer. We'll use a gradient descent optimizer so that we can update the weight based on the derivative of the loss. The optimizer takes a learning rate to moderate the size of the updates, which we'll set at 0.025.

>>> optim = tf.train.GradientDescentOptimizer(learning_rate=0.025)

The optimizer is remarkably clever. It can automatically work out and apply the appropriate gradients through a whole network, carrying out the backward step for learning.

Let's see what the gradient looks like for our simple example.

>>> grads_and_vars = optim.compute_gradients(loss)
>>> sess.run(tf.initialize_all_variables())
>>> sess.run(grads_and_vars[1][0])
## 1.6

Why is the value of the gradient 1.6? Our loss is error squared, and the derivative of that is two times the error. Currently the system says 0.8 instead of 0, so the error is 0.8, and two times 0.8 is 1.6. It's working!

For more complex systems, it will be very nice indeed that TensorFlow calculates and then applies these gradients for us automatically.

Let's apply the gradient, finishing the backpropagation.

>>> sess.run(optim.apply_gradients(grads_and_vars))
>>> sess.run(w)
## 0.75999999  # about 0.76

The weight decreased by 0.04 because the optimizer subtracted the gradient times the learning rate, 1.6 * 0.025, pushing the weight in the right direction.

Instead of hand-holding the optimizer like this, we can make one operation that calculates and applies the gradients: the train_step.

>>> train_step = tf.train.GradientDescentOptimizer(0.025).minimize(loss)
>>> for i in range(100):
>>>     sess.run(train_step)
>>> 
>>> sess.run(y)
## 0.0044996012

Running the training step many times, the weight and the output value are now very close to zero. The neuron has learned!

Training diagnostics in TensorBoard

We may be interested in what's happening during training. Say we want to follow what our system is predicting at every training step. We could print from inside the training loop.

>>> sess.run(tf.initialize_all_variables())
>>> for i in range(100):
>>>     print('before step {}, y is {}'.format(i, sess.run(y)))
>>>     sess.run(train_step)
>>> 
## before step 0, y is 0.800000011921
## before step 1, y is 0.759999990463
## ...
## before step 98, y is 0.00524811353534
## before step 99, y is 0.00498570781201

This works, but there are some problems. It's hard to understand a list of numbers. A plot would be better. And even with only one value to monitor, there's too much output to read. We're likely to want to monitor many things. It would be nice to record everything in some organized way.

Luckily, the same system that we used earlier to visualize the graph also has just the mechanisms we need.

We instrument the computation graph by adding operations that summarize its state. Here, we'll create an operation that reports the current value of y, the neuron's current output.

>>> summary_y = tf.scalar_summary('output', y)

When you run a summary operation, it returns a string of protocol buffer text that can be written to a log directory with a SummaryWriter.

>>> summary_writer = tf.train.SummaryWriter('log_simple_stats')
>>> sess.run(tf.initialize_all_variables())
>>> for i in range(100):
>>>     summary_str = sess.run(summary_y)
>>>     summary_writer.add_summary(summary_str, i)
>>>     sess.run(train_step)
>>> 

Now after running tensorboard --logdir=log_simple_stats, you get an interactive plot at localhost:6006/#events (Figure 3).

A TensorBoard visualization of a neuron’s output against training iteration number
Figure 3. A TensorBoard visualization of a neuron’s output against training iteration number.

Flowing onward

Here's a final version of the code. It's fairly minimal, with every part showing useful (and understandable) TensorFlow functionality.

import tensorflow as tf

x = tf.constant(1.0, name='input')
w = tf.Variable(0.8, name='weight')
y = tf.mul(w, x, name='output')
y_ = tf.constant(0.0, name='correct_value')
loss = tf.pow(y - y_, 2, name='loss')
train_step = tf.train.GradientDescentOptimizer(0.025).minimize(loss)

for value in [x, w, y, y_, loss]:
    tf.scalar_summary(value.op.name, value)

summaries = tf.merge_all_summaries()

sess = tf.Session()
summary_writer = tf.train.SummaryWriter('log_simple_stats', sess.graph)

sess.run(tf.initialize_all_variables())
for i in range(100):
    summary_writer.add_summary(sess.run(summaries), i)
    sess.run(train_step)

The example we just ran through is even simpler than the ones that inspired it in Michael Nielsen's Neural Networks and Deep Learning. For myself, seeing details like these helps with understanding and building more complex systems that use and extend from simple building blocks. Part of the beauty of TensorFlow is how flexibly you can build complex systems from simpler components.

If you want to continue experimenting with TensorFlow, it might be fun to start making more interesting neurons, perhaps with different activation functions. You could train with more interesting data. You could add more neurons. You could add more layers. You could dive into more complex pre-built models, or spend more time with TensorFlow's own tutorials and how-to guides. Go for it!


Article image: Braided river. (source: National Park Service, Alaska Region on Flickr).

Aaron Schumacher

Aaron Schumacher is a data scientist and software engineer for Deep Learning Analytics. He has taught with Python and R for General Assembly and the Metis data science bootcamp. Aaron has also worked with data at Booz Allen Hamilton, New York University, and the New York City Department of Education. Aaron’s career-best breakdancing result was advancing to the semi-finals of the R16 Korea 2009 individual footwork battle. He is honored to now be the least significant contributor to TensorFlow 0.9. 



+ Recent posts