Comparison of Apache Stream Processing Frameworks: Part 1

Posted by Petr Zapletal on Tue, Feb 2, 2016

 

A couple of months ago we were discussing the reasons behind increasing demand for distributed stream processing. I also stated there was a number of available frameworks to address it. Now it’s a time have a look at them and discuss their similarities and differences and their, from my opinion, recommended use cases.

As you probably suspect, distributed stream processing is continuous processing, aggregation and analysis of unbounded data. It’s a general computation model as MapReduce, but we expect latencies in mills or in seconds. These systems are usually modeled as Directed Acyclic Graphs, DAGs.

DAG is a graphical representation of chain of tasks and we use it for description of the topology of streaming job. I’ll help myself with terminology from Akka streams a little bit. So as you can see in the picture below, data flows through chain of processors from sources to sinks which represents the streaming task. And speaking about Akka streams, I think it’s very important to emphasize the word distributed. Because even local solutions can create and run DAG but we’re going to focus only to solutions running on multiple machines

Screen_Shot_2015-12-28_at_16.36.48.png

Points of Interest

When choosing between different systems there are a couple of points we should take care of. So let’s start with the Runtime and Programming model. The programming model provided by a platform determines a lot of its features and it should be sufficient to handle all possible use-cases for the application. This is a really crucial topic and I’ll come back very soon. 

The Functional Primitives exposed by a processing platform should be able to provide rich functionalities at individual message level like map or filter, which are pretty easy to implement even if you want to scale a lot. But it should also provide across messages functionality, like aggregations, and across stream operations like joins for example, which are much harder to scale.

State Management - Most of the applications have stateful processing logic that requires maintaining a state. The platform should allow us to maintain, access and update the state information.

For Message Delivery Guarantees, we have a couple of classes like at most once, at least once and exactly once. At most once delivery means that for each message handed to the mechanism, that message is delivered zero or one times, so messages may be lost. At least once delivery means that for each message handed to the mechanism potentially multiple attempts are made at delivering it, such that at least one succeeds. Messages may be duplicated but not lost. And finally exactly once delivery means that for each message handed to the mechanism exactly one delivery is made to the recipient, the message can neither be lost nor duplicated. So another important thing to consider.

Failures can and will happen at various levels - for example network partitions, disk failures or nodes going down and so on. Platform should be able to recover from all such failures and resume from their last successful state without harming the result

And then, we have more performance related requirements like Latency, Throughput and Scalability which are extremely important in streaming applications.

We should also take care of Maturity and Adoption Leve this information could give us a clue about potential support, available libraries or even stackoverflow answers. And it may help us a lot when choosing correct platform.

And also, last but not least, the Ease of Development and Ease of Operability. It is great when we have super fancy system which covers all our use cases but if we cannot write a program for it or we cannot deploy it we are done anyway

Runtime and Programming Model

Runtime and Programing model, is probably the most important trait of the system because it defines its expressiveness, possible operations and future limitations. Therefore it defines system capabilities and its use cases.

There are two distinctive approaches how to implement streaming system. First one is called the native streaming. It means all incoming records, or events if you want, are processed as they arrive, one by one.


Screen_Shot_2015-12-28_at_16.39.08.png

Second approach is called micro-batching. Short batches are created from incoming records and go through the system. These batches are created according to pre-defined time constant, typically every couple of seconds.

Screen_Shot_2015-12-28_at_16.40.36.png


Both approaches have inherent advantages and disadvantages. Let’s start with native streaming. The great advantage of native streaming is its expressiveness. Because it takes stream as it is, it is not limited by any unnatural abstraction over it. Also as the records are processed immediately upon arrival, achievable latencies of these systems are always better than its micro-batching companions. Apart from that, stateful operations are much easier to implement, as you’ll see later in this post. Native streaming systems have usually lower throughput and fault-tolerance is much more expensive as it has to take care (~persist & replay) of every single record. Also load-balancing is kind of issue. For example, let’s say we have data partitioned by a key and we want to process it. If the processing of some key of the partition is more resource intensive for any reason, this partition quickly becomes the job’s bottleneck

Secondly, the micro-batching. Splitting stream into micro-batches inevitably reduces system expressiveness. Some operations, especially state management or joins and splits, are much harder to implement as systems has to manipulate with whole batch. Moreover, the batch interval connects two things which should never be connected - an infrastructure property and a business logic. On the contrary, fault tolerance and load balancing are much simpler as systems just sends every batch to a worker node and if something goes wrong it just use the different. Lastly, it is good to remark, we can build micro-batching system atop native streaming quite easily

Programming models can be classified as Compositional and Declarative. Compositional approach provides basic building blocks like sources or operators and they must be tied together in order to create expected topology. New components can be usually defined by implementing some kind of interfaces. On the contrary, operators in declarative API are defined as higher order functions. It allows us to write functional code with abstract types and all its fancy stuff and the system creates and optimizes topology itself. Also declarative APIs usually provides more advanced operations like windowing or state management out of the box. We are going to have look at some code samples very soon.

Apache Streaming Landscape

There is a number of diverse frameworks available and it is literally impossible to cover all of them. So I’m forced to limit it somehow and I want to go for popular streaming solutions from Apache landscape which also provide Scala API. Therefore We’re going to focus on Apache Storm and its sibling Trident and on streaming module of very popular Spark. We’re also going talk about streaming system behind linkedIn named Samza and finally, we’re going to discuss promising Apache project Flink. I believe this is a great selection, because even if all of them are streaming systems, they approach various challenges very differently. Unfortunately I won’t talk about proprietary systems like Google MillWheel or Amazon Kinesis and also we’re going to miss interesting but still limitedly adopted systems like Intel GearPump or Apache Apex. That may be for a next time.

Screen_Shot_2015-12-28_at_16.43.04.png
Apache Storm was originally created by Nathan Marz and his team at BackType in 2010. Later it was acquired and open-sourced by Twitter and it became apache top-level project in 2014. Without any doubts, Storm was a pioneer in large scale stream processing and became de-facto industrial standard. Storm is a native streaming system and provides low-level API. Also, storm uses Thrift for topology definition and it also implements Storm multi-language protocol this basically allows to implement our solutions in large number of languages, which is pretty unique and Scala is of course of them.

Trident is a higher level micro-batching system build atop Storm. It simplifies topology building process and also adds higher level operations like windowing, aggregations or state management which are not natively supported in Storm. In addition to Storm's at most once, Trident provides exactly once delivery, on the contrary of Storm’s at most once guarantee. Trident has Java, Clojure and Scala APIs.

As we all know, Spark is very popular batch processing framework these days with a couple of built-in libraries like SparkSQL or MLlib and of course Spark Streaming. Spark’s runtime is build for batch processing and therefore spark streaming, as it was added a little bit later, does micro-batching. The stream of input data is ingested by receivers which create micro-batches and these micro-batches are processed in similar way as other Spark’s jobs. Spark streaming provides high-level declarative API in Scala, Java and Python.

Samza was originally developed in LinkedIn as proprietary streaming solution and with Kafka, which is another great linkedIn contribution to our community, it became key part of their infrastructure. As you’re going to see a little bit later, Samza builds heavily on Kafka’s log based philosophy and both together integrates very well. Samza provides compositional api and of course Scala is supported.

And the last but least, Flink. Flink is pretty old project, it has it’s origins in 2008, but right now is getting quite a lot of attention. Flink is native streaming system and provides a high level API. Flink also provides API for batch processing like Spark, but there is a fundamental distinction between those two. Flink handles batch as a special case of streaming. Everything is a stream and this is definitely better abstraction, because this is how the world really looks like.

That was a quick introduction of the systems and, as you can see on the table below, they do have pretty different traits.

Screen_Shot_2015-12-28_at_16.44.49.png

Counting Words

Wordcount is something like hello world of stream processing, it nicely shows main differences between our frameworks. Let’s start with Storm and please note, the example was simplified significantly. 

123456789101112131415 TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) {   String word = tuple.getString(0);   Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;   counts.put(word, count);   collector.emit(new Values(word, count)); }

First, let’s have a look at its topology definition. As you can see at line 2, we have to define a spout, or if you want, a source. And then, there is a bold, a processing component, which splits the text into the words. Then I have defined another bolt for actual word count calculation at line 4. Also have a look at the magic numbers 5, 8 and 12. These are the parallelism hints and they define how many independent threads around the cluster will be used for execution of every component. As you can see, all is very manual and low-level. Now, let’s have a look (lines 8 - 15) how is the actual WordCount bolt implemented. As long as Storm does not have in-build support for managed state so I have defined a local state, which is far from being ideal but good as a sample. Apart of that it is not very interesting, so let’s move on and have a look at Trident.

As I mentioned before, Trident is Storm’s micro-batching extension and Trident, apart of many other goodies, provides state management which is pretty useful when implementing wordcount.

1public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...

TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));

...

}

As you can see, I could use higher level operations like each (line 7) and groupBy (line 8), so it’s a little bit better. And also I was able to use Trident managed state for storing counts at line 9. 


Now it is time for a pretty declarative API provided by Apache Spark. Also keep in mind, on the contrary to the previous examples, which were significantly simplified, this is nearly all code you need, to run this simple streaming wordcount.

1val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))

val text = ...

val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)

counts.print()

ssc.start()
ssc.awaitTermination()

Every spark streaming job requires StreamingContext which is basically the entry point to the streaming functionality. StreamingContext takes a configuration which is, as you can see at line 1, in our case very limited, but more importantly, it defines its batch interval (line 2), which is set to 1 second. Now you can see whole word count computation (lines 6 - 8), quite a difference, isn’t it ? That’s the reason why Spark is sometimes called Distributed Scala. As you can see, it’s quite standard functional code and spark takes care of topology definition and its distributed execution. And now, at line 12, the last part of every spark streaming job - starting the computation, just keep in mind, once started job cannot be modified.

Now let’s have a look at Apache Samza, another representative of compositional API.

1class WordCountTask extends StreamTask {

override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector,
coordinator: TaskCoordinator) {

val text = envelope.getMessage.asInstanceOf[String]

val counts = text.split(" ").foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}

collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts))

}

The Topology is defined in samza’s properties file, but for the sake of clarity, you won’t find it here. For us it’s important the task has defined input and output channels and the communication goes through kafka topics. In our case the whole topology is the WordCountTask, which does all the work. In Samza the components are defined by implementing particular interfaces, in this case it’s a StreamTask and I’ve just overridden method process at line 3. Its parameter list contains all what’s need for connecting with the rest of the system. The computation itself at lines 8 - 10 is just a simple Scala.

Now let’s have a look at Flink, as you can see API is pretty similar to spark streaming, but notice we are not setting any batch interval.

1 val env = ExecutionEnvironment.getExecutionEnvironment

val text = env.fromElements(...)
val counts = text.flatMap ( _.split(" ") )
.map ( (_, 1) )
.groupBy(0)
.sum(1)

counts.print()

env.execute("wordcount")

Computation itself is pretty straight forward. As you can see, there’s just a couple of functional calls and Flink takes care of its distributed computation.

Conclusion

That’s all folks for today. We went through the necessary theory and also introduced popular streaming framework from Apache landscape. Next time, we’re going to dig a little bit deeper and go through the rest of the points of interest. Hope you find it interesting and stay tuned. Also if you have any questions, don’t hesitate to contact me as I’m always happy to discus the topic.


R - 데이터 고급 분석과 통계 프로그래밍을 위한

1. 'R' ?

오픈소스 '통계분석' 언어. 정확한 프로젝트 명칭은 'GNU S' 

산더미 같은 데이터 중에서 추정계수(?), 표준오차(?), 잔차(?) 등과 같은 통계적 분석을 거쳐서 좀 더 의미있는 데이터에 접근하기 위해 사용한대..

AT&T 에서 개발한 통계언어 S 에서 영향을 받았음. 

'S' 는 아마 Statistics 에서 따왔을 것이고 R 은 알파벳 순으로 S 보다 앞에 있어서..(?)

'S-plus' 라고 있는데, 이건 'S' 에 GUI가 추가된 상용판 언어로 전문 통계학자들 사이에서는 사실상 표준

많은 '통계적 지식'을 필요로 함 (허들이 제법 높다.)

프로젝트 홈페이지

빅데이터 시대에 데이터 접근에 대한 표준역할은 나(R)에게 맡기고, 이후의 데이터 처리는 당신들이 지금껏 써왔던 언어를 쓰시라....는게 컨셉인 언어

 

2. 'R' 특징

상용 분석 도구인 SAS 나 SPSS 보다 패키지 업데이트가 빠르고, 다양한 데이터 소스와의 연결이 장점.

각 세션사이마다 시스템에 데이터 세트를 저장하므로 데이터를 매번 다시 로딩할 필요가 없다.

'병렬 컴퓨팅'과 궁합이 잘 맞는듯.

 

3. 'R' 맛만 한번 보기

설치

CRAN(the Comprehensive R Archive Network) 에 접속. OS에 맞는 링크 클릭해서 안내를 따를것.  

나는 'Download R for Window' -> base (기본 배포 바이너리) -> R-3.2.2 for Windows (32/64 bit) -> R-3.2.2-win.exe 를 다운로드

 R-3.2.2-win.exe 실행. x64 모드로 설치

실행

바탕화면 숏컷 "R x64 3.2.2" 실행

R Console 이 표시되면 프롬프트에서 명령어를 입력

demo()

DeepMind moves to TensorFlow

오늘 구글은 블로그를 통해 딥마인드(Deepmind.com)가 토치(Torch)에서 텐서플로우(TensorFlow)로 연구, 개발 프레임워크를 교체한다고 발표하였습니니다.

토치는 그동안 딥마인드와 페이스북이 사용하면서 코드 기여도 상당했었지만 딥마인드가 토치를 더이상 사용하지 않게 됨으로써 토치 진영에는 페이스북과 트위터가 남게 되었습니다. 어찌보면 예상된 수순이었을지 모르겠습니다. 이세돌과 알파고가 바둑대전을 할 때 제프딘이 언급한 것 처럼 이미 알파고에 텐서플로우가 일정 부분 사용되고 있었기 때문입니다.

며칠 전 선다 피차이(Sundar Pichai) 구글 CEO가 보낸 창업자의 편지(Founder’s letter)에서 앞으로 세상은 모바일 퍼스트(mobile first)에서 AI 퍼스트(AI first)로 변할 것이라고 말하고 있습니다.

“We will move from mobile first to an AI first world.”

이런 비전의 첫 교두보인 텐서플로우가 무엇보다도 널리 쓰이길 바라는 것이 구글에서는 당연할 것입니다. 딥마인드는 지난 몇 달간 기존의 연구를 텐서플로우로 변경하면서 테스트를 해왔고 이제 앞으로 딥마인드에서 수행하는 모든 연구는 텐서플로우를 이용할 것이라고 합니다.

물론 텐서플로우를 PR 하는 것이라고 비아냥 거리는 냉소도 인터넷에 돌아다니긴 합니다.

First Contact with TensorFlow

tensorflowbookcover-1024x7122x

‘First Contact with TensorFlow’ Book Cover. 출처: http://www.jorditorres.org

스페인 카탈루냐 공과대학의 Jordi Torres 교수가 텐서플로우를 소개하는 책 ‘First Contack with TensorFlow‘을 공개했습니다. 책의 서두에서도 밝혔지만 이 책은 ‘Fundamental of Deep Learning‘ 같은 책 처럼 머신러닝의 이론적 배경에 무게를 두기 보다는 텐서플로우를 처음 접하는 사람들을 위한 입문서 용도로 쓰여 졌습니다.

이 책은 대략 140여 페이지 정도로 pdf 나 인쇄본으로 구매할 수도 있지만 온라인에서 무료로 읽을 수 있도록 책 전체를 공개하고 있습니다. 부가적으로 Jordi Torres 교수가 강의에 활용했던 슬라이드는 여기에서 다운받으실 수 있습니다.

원 도서의 라이센스(CC BY-NC-SA 3.0)에 따라 전체 내용을 번역하여 공유하겠습니다.

추천사, 서문, 실용적접근

1. 텐서플로우 기본다지기

2. 텐서플로우 선형회귀분석(Linear Regression)

3. 텐서플로우 클러스터링(Clustering)

4. 단일 레이어(layer) 뉴럴 네트워크(Neural Network)

5. 멀티 레이어(multi-layer) 뉴럴 네트워크

6. 병렬화

마치며

 

Gradientzoo – 머신러닝 모델 공유

스크린샷 2016-04-26 오후 1.50.21

파이썬에서 학습된 머신러닝 모델을 저장하고 복원하는(즉 serialization 과 deserialization) 몇가지 방법 중 보통은 pickle 이나 dill 을 사용할 수 있습니다. 최근에는 JSON 을 사용하는 경우도 많이 있는 것 같습니다. 이와 관련된 주제로 한 번 정리를 하면 좋을 것 같습니다.

Gradientzoo 는 오픈소스 프로젝트로서 학습된 모델을 직렬화하는 파이썬 라이브러리 뿐만 아니라 이를 공유하는 사이트도 모두 오픈소스로 공개하였습니다. 다른 사람이 만든 모델을 다운받아 이를 이용하여 모델링을 시작하자는 취지 입니다. graidentzoo는 현재Keras, Lasagne, TensorFlow 를 지원하고 있습니다.

Microsoft Azure 기계 학습 스튜디오용 기계 학습 알고리즘 치트 시트

 게시자 Brandon Rohrer업데이트: 02-10-2016

Microsoft Azure 기계 학습 알고리즘 치트 시트를 사용하면 알고리즘의 Microsoft Azure 기계 학습 라이브러리에서 예측 분석 솔루션에 대해 올바른 기계 학습 알고리즘을 선택할 수 있습니다.

Azure 기계 학습 스튜디오는 예측 분석 솔루션에 대해 많은 수의 기계 학습 알고리즘와 함께 제공됩니다. 이러한 알고리즘은 회귀, 분류, 클러스터링  이상 탐지의 일반적인 기계 학습 범주에 해당하며, 각각 다른 유형의 기계 학습 문제를 해결합니다.

참고:

이 치트 시트를 사용하는 방법에 대한 자세한 내용은 Microsoft Azure 기계 학습을 위한 알고리즘 선택 방법 문서를 참조하세요.

기계 학습 알고리즘 치트 시트 다운로드

기계 학습 알고리즘 치트 시트를 다운로드하고 솔루션에 대한 기계 학습 알고리즘을 선택하는 방법을 찾는 데 도움이 됩니다. 근처에서 유지하려면, tabloid 크기(11 x 17인치)로 치트 시트를 인쇄할 수 있습니다.

Microsoft Azure 기계 학습 알고리즘 참고 자료에서 참고 자료를 다운로드하세요.

기계 학습 알고리즘 치트 시트: 기계 학습 알고리즘을 선택하는 방법에 대해 알아봅니다.

알고리즘에 대한 자세한 도움말

Azure 기계 학습을 무료로 사용해 보십시오

신용 카드 또는 Azure 구독이 필요하지 않습니다. 지금 시작하기 >

 

5 Ways Machine Learning Is Reshaping Our World

Who here remembers taking computer programming in school? Whether you learned programming by punching holes in a never ending series of cards, or by writing simple DOS or other computer language commands, the fact remained that computers needed an incredibly precise set of instructions to accomplish a task.

The more complicated the task, the more complicated your instructions had to be. 

Machine learning is inherently different. Rather than telling a computer exactly how to solve a problem, the programmer instead tells it how to go about learning to solve the problem for itself.

Machine learning is really just the very advanced application of statistics to learning to identify patterns in data and then make predictions from those patterns.  This website has a gorgeous visualized walkthrough of how machine learning works, if you are interested. 

Machine learning started as far back as the 1950s, when computer scientists figured out how to teach a computer to play checkers. From there, as computational power has increased, so has the complexity of the patterns a computer can recognize, and therefore the predictions it can make and problems it can solve. 

1. Machines can see. 

Because computers are able to look at a large data set and use machine learning algorithms to classify images, it’s relatively easy to write an algorithm that can recognize characteristics in a group of images and categorize them appropriately.  


For example, it takes four highly trained medical pathologists to review a breast cancer scan, decide what they’re seeing, and then make a decision about a diagnosis. Now, an algorithm has been written that can detect the cancer more accurately than the best pathologists, freeing the doctors up to make the treatment decisions more quickly and accurately. 


The fact that computers can see is also how we get driverless cars. A computer that can recognize the difference between a tree and a pedestrian, a stop and a yield sign, and a road or a field – which is the key to unlocking the promise of the driverless car.  And this innovation alone could revolutionize many different business models, from supply chain and delivery to personal transportation. 

2. Machines can read.


Google long ago proved the value of a program that can read text. Their search engine algorithm revolutionized Internet search, and continues to do so with every advancement. 


But it’s one thing to be able to say whether or not a document contains a certain word or phrase; it’s something else entirely to understand context.


New algorithms are being developed that can determine whether a sentence is positive or negative, context within a document, and more.  


In fact, using Google’s street view and its ability to read street numbers, the company was able to map all the addresses in France in just a few hours — a feat that would have taken many talented mapmakers weeks, if not months in the past.

3. Machines can listen.


One of the biggest innovations in recent years is probably in your pocket right now. Siri, Cortana, and Google Now represent a huge leap in machine understanding of human speech.  


How many times have you been frustrated trying to get a computer at the other end of a telephone help line to understand you? (I’m sorry, I didn’t catch that… Please repeat your account number…)


Now, virtual personal assistants can recognize a dizzying and ever growing array of commands and respond in kind.  More importantly, however, Google and its competitors are moving towards keying their search algorithms to understand natural speech as well, in anticipation of more and more voice search.


In the old days, you would have to type something like, coffee shop + London + a postal code to find a listing of coffee shops in an area. Today, you can type — or speak — a natural sentence like, “Where’s the nearest coffee shop that’s open right now?” and Google understands not only what you mean, but where you are, what time it is, and how to respond. 

4. Machines can talk. 

Yes, Siri can tell you a knock-knock joke, but that’s not really the kind of talking I’m talking about. 


Computer language translations are something of a running joke, and for good reason. There are so many nuances to language — slang, idioms, cultural meaning — that simply running a piece of text through translation software can produce some amusing and ultimately incorrect results.


But new machine learning algorithms are making more accurate, real-time translations possible.


Late last year, Microsoft unveiled real time translations for Skype video conferencing in English and Spanish, with plans to support more than 40 languages. 


While the advance in the translation ability is impressive, it’s the combination of listening to the user speak, understanding the words, and translating them all in real time that’s the impressive breakthrough. And because the program is machine learning-based, it will only get better with practice. 

5. Machines can write. 
While it may take a million monkeys typing to produce the works of Shakespeare, computers are getting a lot better at creative writing.
 

In one project, a computer was taught to write photo captions describing the pictures. In its first iteration, human readers thought the computer generated description was better than the human generated words one out of four times.


This has broad implications for all kinds of data entry and classification tasks that previously required human intervention. If a computer can recognize something — an image, a document, a file, etc. — and describe it accurately, there could be many uses for such automation. 

Another example I have covered before is how during the 2015 Wimbledon tennis championships machine learning algorithms were used to automatically turn match statistics and sensor data collected durin... which read as if they were written by sports journalists.

These skills are beginning to show that computers can now boldly go into realms that were once considered solidly the domain of humans. While the technology still isn’t perfect in many cases, the very concept of machine learning — that machines can continuously and tirelessly improve, they will get better.

 

OAUTH2 Authentication with ADFS 3.0

By , on the 9th Mar 2015

A quick run through of the steps involved in integrating a Node.js client with Active Directory Federation Services for authentication using OAUTH2.

I recently had the dubious pleasure of proving the feasibility of authenticating apps against ADFS using its OAUTH2 endpoints. In short, whilst it is possible to securely prove identity and other claims, I’m left thinking there must be a better way.

Configuring ADFS for a new OAUTH2 client

I started with an Azure Windows Server 2012 R2 VM pre-configured with an ADFS instance integrated with existing SAML 2.0 clients (or Relying Parties in identity-speak). As I was only interested in proving the OAUTH2 functionality I could piggy-back on one of the existing Trusts. If you need to set one up, this guide might be useful.

To register a new client, from an Administrative PowerShell prompt, run the following -

Add-ADFSClient -Name "OAUTH2 Test Client" -ClientId "some-uid-or-other" -RedirectUri="http://localhost:3000/getAToken"

This registers a client called OAUTH2 Test Client which will identify itself as some-uid-or-other and provide http://localhost:3000/getAToken as the redirect location when performing the authorization request (A) to the Authorization Server (in this case ADFS).

The Authorization Code Flow

+----------+
| Resource |
|   Owner  |
|          |
+----------+
     ^
     |
    (B)
+----|-----+          Client Identifier      +---------------+
|         -+----(A)-- & Redirection URI ---->|               |
|  User-   |                                 | Authorization |
|  Agent  -+----(B)-- User authenticates --->|     Server    |
|          |                                 |               |
|         -+----(C)-- Authorization Code ---<|               |
+-|----|---+                                 +---------------+
  |    |                                         ^      v
 (A)  (C)                                        |      |
  |    |                                         |      |
  ^    v                                         |      |
+---------+                                      |      |
|         |>---(D)-- Authorization Code ---------'      |
| Client  |          & Redirection URI                  |
|         |                                             |
|         |<---(E)----- Access Token -------------------'
+---------+       (w/ Optional Refresh Token)

The diagram above, taken from the OAUTH2 RFC, represents the Authorization Code Flow which is the only flow implemented by ADFS 3.0. This is the exchange that’s going to end up taking place to grant a user access. It’s pretty easy to understand but it’s worth pointing out that - Some of the requests and responses go via the User-Agent i.e. they’re HTTP redirects. (B) is a double-headed arrow because it represents an arbitrary exchange between the Authorization Server (ADFS) and the Resource Owner (user) e.g. login form -> submit -> wrong password -> submit.

The ADFS 3.0 Authorization Code Flow

The OAUTH2 specification isn’t any more specific than that, I’ll come back to this. So now you need to know what this translates to on the wire. Luckily someone’s already done a great job of capturing this (in more detail than reproduced below).

A. Authorization Request

GET /adfs/oauth2/authorize?response_type=code&client_id=some-uid-or-other&resource=urn%3Arelying%3Aparty%3Atrust%3Aidentifier&redirect_uri=http%3A%2F%2Flocalhost%3A3000%2FgetAToken HTTP/1.1
Host: your.adfs.server

In this request the app asks the ADFS server (via the user agent) for an authorization code with the client_id and redirect_uri we registered earlier and a resource identifier associated with a Relying Party Trust.

B. The Actual Login Bit…

This is the bit where the sign-in is handed off to the standard ADFS login screen if you don’t have a session or you’re implicitly signed in if you do. Speaking of that login screen, if you were hoping to meaningfully customise it, forget it.

C. Authorization Grant

HTTP 302 Found
Location: http://localhost:3000/getAToken?code=<the code>

D. Access Token

POST /adfs/oauth2/token HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: your.adfs.server
Content-Length: <some number>

grant_type=authorization_code&client_id=some-uid-or-other&redirect_uri=http%3A%2F%2Flocalhost%3A3000%2FgetAToken&code=thecode

E. Access Token

HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8

{ 
    "access_token":"<access_token>",
    "token_type":"bearer",
    "expires_in":3600
}

Establishing the user’s identity and other grants

The interesting bit is the <access_token> itself, it is in fact a JSON Web Token (JWT). That’s to say a signed representation of the user’s identity and other grants. You can either opt to trust it if you retrieved it over a secure channel from the ADFS server, or validate it using the public key of the configured Token Signing Certificate.

Here’s the example Node.js implementation I created, which opts to validate the token. The validation itself is performed by the following snippet -

var adfsSigningPublicKey = fs.readFileSync('ADFS-Signing.cer'); // Exported from ADFS
function validateAccessToken(accessToken) {
    var payload = null;
    try {
        payload = jwt.verify(accessToken, adfsSigningPublicKey);
    }
    catch(e) {
        console.warn('Dropping unverified accessToken', e);
    }
    return payload;
}

Obtaining refresh tokens from ADFS 3.0

Refresh tokens are available from the ADFS implementation but you need to be aware of the settings detailed in this blog post. To set them you’d run the following from an Administrative PowerShell prompt -

Set-AdfsRelyingPartyTrust -TargetName "RPT Name" -IssueOAuthRefreshTokensTo AllDevices
Set-AdfsRelyingPartyTrust -TargetName "RPT Name" -TokenLifetime 10
Set-AdfsProperties -SSOLifetime 480

This would issue access tokens with a lifetime of 10 minutes and refresh tokens to all clients with a lifetime of 8 hours.

Conclusion

Whilst I did get the OAUTH2 integration to work, I was left a bit underwhelmed by it especially when compared to the features touted by AzureAD. Encouraged by TechNet library docs, I’d initially considered ADFS to be compatible with AzureAD and tried to get ADAL to work with ADFS. However, I quickly discovered that it’s expecting an OpenID Connect compatible implementation and that’s something ADFS does not currently offer.

It might be my lack of Google foo, but this became typical of the problems I had finding definitive documentation. I think this is just one of the problems associated with the non-standardised OAUTH2 standard. Another is the vast amount of customisation you must do to make an OAUTH2 library work with a given implementation. OpenID Connect looks like a promising solution to this, but only time will tell if it gains significant adoption.

When things go wrong…

Whilst trying to work out the correct configuration, I ran into a number of errors along the way. Most of them pop out in the ADFS event log but occasionally you might also get a helpful error response to an HTTP request. Here’s a brief summary of some of the ones I encountered and how to fix them -

Microsoft.IdentityServer.Web.Protocols.OAuth.Exceptions. OAuthInvalidClientException: MSIS9223: Received invalid OAuth authorization request. The received ‘client_id’ is invalid as no registered client was found with this client identifier. Make sure that the client is registered. Received client_id: ‘…’.

When making the authorize request, you either need to follow the process above for registering a new OAUTH2 client or you’ve mistyped the identifier (n.b. not the name).

Microsoft.IdentityServer.Web.Protocols.OAuth.Exceptions. OAuthInvalidResourceException: MSIS9329: Received invalid OAuth authorization request. The ‘resource’ parameter’s value does not correspond to any valid registered relying party. Received resource: ‘…’.

When making the authorize request you’ve either got a typo in your RPT identifier, you need to create an RPT with the given identifier or you need to register it against an existing RPT.

Microsoft.IdentityServer.Web.Protocols.OAuth.Exceptions. OAuthAuthorizationMissingResourceException: MSIS9226: Received invalid OAuth authorization request. The ‘resource’ parameter is missing or found empty. The ‘resource’ parameter must be provided specifying the relying party identifier for which the access is requested.

When making the authorize request, you’ve not specified a resource parameter, see previous. I found that most OAUTH2 libraries expect to pass a scope but not a resource parameter.

HTTP error 503

This normally meant I had a typo in the /adfs/oauth2/authorize or /adfs/oauth2/token URLs (don’t forget the 2).

IDINCU Engineering

오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper)

1. 필요성

요새 그렇지 않은 서비스가 어디있겠냐마는, 저희 서비스 역시 하루에 수십~수백만 건의 로그가 군데 군데 쌓이고 있으며, 수십~수백만 건의 설문 응답 데이터 등이 중앙 RDBMS에 저장되고 있습니다. 그런데 최근 서비스 기반 아키텍쳐 (SOA)로 시스템을 점차 전환하면서 각각의 분산 서버에서는 기존의 중앙 RDBMS에 직접 접근하지 않게 되자, 제각기 발생하는 엄청나게 다양한 로그 및 데이터를 효율적으로 관리/분배할 수 있는 시스템이 필요하게 되었습니다. 대표적으로 Netflix 사에서 공개한 Suro라는 오픈소스 프로젝트가 있었지만, 데이터를 전송하는 transport layer부터 데이터를 저장하는 sink 단까지 저희 서비스에 딱 맞게 입맛대로 구성하고 싶었고 (non-java client 지원, metadata 등을 능동적으로 읽어 오기도 하는 시스템에 융합된 sink 개발, …) 새로이 개발하는 공수가 대단히 클 것 같지는 않았기에 직접 만들어 보게 되었습니다. 라고 쓰고 시켜서 만들었다고 읽습니다 /ㅁ/

2. 사용된 오픈소스

1) Thrift
가장 많은 트래픽을 감당해야 하는 시스템이니만큼 속도와 안정성 측면에서 충분한 검증을 받은 라이브러리를 transport layer로 선택하여야 했습니다. Protocol Buffer나 Avro에 대한 고려도 해 보았지만, 역시나 Facebook의 이름값에 넘어가 비교적 가장 많이들 사용하고 있는 Thrift를 사용하기로 결정하였습니다.

2) Zookeeper
데이터를 어떻게 분배할 것인가에 대한 rule이나, 현재 동작 중인 서버들의 상태 등을 관리하기 위해 사용하였습니다. 무난하게 RDBMS를 사용할 수도 있지만 매번 rule을 조회하기에는 부하가 너무 크고, cache를 사용하기에는 rule 변경 시 바로 적용되지도 않을뿐더러 서버마다 cache가 만료되는 시점이 달라 문제가 생길 수가 있습니다. 하지만 Zookeeper를 사용할 경우 rule의 변경 사항이 생길 때 곧바로 푸시를 해 줄 수 있고, 커넥션이 끊어질 경우 데이터가 휘발되게 하는 옵션이 있어 서버의 동작 여부를 알아 보기에도 매우 용이합니다.

3) Curator
원래도 아주 간단한 Zookeeper이긴 하지만, 그 Zookeeper를 보다 더 간편하게 사용할 수 있도록 해 주는 라이브러리입니다.

3. 프로젝트 구성

본 프로젝트는 thrift-lib, server, coordinator, client, 총 4개의 서브 프로젝트로 구성되어 있습니다. 각 서브 프로젝트의 역할과 함께 Thrift, Zookeeper의 활용 방법도 소개드리겠습니다.

1) thrift-lib
말 그대로 서비스 전반에서 thrift를 기반으로 통신하기 위해 데이터 및 호출 스펙을 정의한 서브 프로젝트입니다. 다른 모든 서브 프로젝트에서 본 서브 프로젝트를 import하여 사용합니다.

위와 같은 형식으로 [filename].thrift 파일을 생성한 후

thrift –gen java [filename].thrift

와 같이 명령을 내리면 gen-java 디렉터리 안에 java에서 사용할 수 있는 통신 라이브러리를 자동으로 생성해 줍니다. 해당 코드를 사용하여 아주 간편하게 통신을 할 수 있습니다. 물론 java 외에도 수많은 언어를 지원하기 때문에 나중에 다른 언어의 server나 client를 개발하기에도 아주 용이합니다.

2) server
server는 client로부터 다량의 데이터를 받아 큐에 쌓아 두고, 별개의 thread에서는 큐를 읽어 정해진 rule에 따라 데이터를 여러 sink에 분배합니다. 로컬 파일로 저장하는 sink, graylog로 전송하는 sink, email로 전송하는 sink, 자체적으로 관리하는 storage 서비스에 전송하는 sink 등 다양한 sink를 만들어 붙일 수 있겠습니다. 앞에서 언급했던 Zookeeper는 다음과 같이 활용하고 있습니다.

위와 같이 특정 node를 watch하여 변경 사항이 생길 시 listener를 통해 즉시 반영할 수 있습니다. 또한 아래와 같이 Zookeeper에 노드를 생성할 수도 있으며, CreateMode.EPHEMERAL로 해당 커넥션이 종료될 경우 노드가 자동으로 삭제되게 할 수도 있습니다. 참고로 위의 node watch 기능이 거의 latency 없이 작동했던 반면, 커넥션이 종료된 후 node가 사라지는 데까지는 수십 초 가량의 시간이 걸려, 서버 재시작 시 기존 node가 살아 있어 익셉션이 나는 것을 방지하기 위해 retry 처리를 해 주어야 했습니다.

3) coordinator
coodinator는 위에서 server들이 Zookeeper에 생성에 놓은 node들을 토대로 client에게 현재 서버들의 상태를 알려 주는 아주 간단한 역할을 맡고 있습니다. HAProxy 같은 load balancer를 사용할 수도 있겠지만 이처럼 로그와 같은 대량의 데이터가 실시간으로 몰리는 시스템에서는 load balancer 그 자체가 single point of failure가 될 수 있기 때문에, 맨 처음 접속 시나 장애 시에만 coordinator로부터 서버들의 정보를 받아온 후로는 각 서버에 바로 접속하는 구조를 사용하게 되었습니다. 다음은 앞서 1)에서 설명했던 thrift-lib을 이용하여 서버를 구동하는 코드입니다. 보시는 바와 같이 아주 간단하게 적용할 수 있으며, Nonblocking 서버 외에도 ThreadPool 서버나 Simple 서버 등을 제공합니다.

 

4) client
client는 구동시 coorinator에 한 번 접속하여 server list를 받아온 후, 한 server에 접속하여 데이터를 전송할 수 있게 해 줍니다.

client 측 코드 역시 Thrift를 사용하여 간단하게 작성할 수 있으며, logback을 통해 로그 또한 간편하게 전송할 수 있도록 새로운 appender도 만들어 주었습니다.

이렇게 appender를 만들고 난 후 logback.xml에 추가하면 @Slf4j 어노테이션을 통해 간편하게 로그를 전송할 수 있게 됩니다.

 

4. 마치며

데이터 분배 시스템 구성도
데이터 분배 시스템 구성도

결과적으로 이런 시스템이 완성되었습니다! 저희는 현재 본 시스템을 사용하여 데이터를 분배하고 있으며, 이 외에도 사내 클라우드 시스템 또한 오픈소스를 잘 사용하여 개발, 사용하고 있습니다.  오픈소스를 이용하는 중에 문제가 보이면 직접 contribute하는 분들도 계시고요. 이미 다들 그러시겠지만, 혹 아직 그렇지 않은 분들도 계시다면 편리하고 보람찬 오픈소스 생활을 누려 봅시다 : D

 

As a programmer, you must have or would  face this question on the path to career progression in the field of data science. While designing definitive distinctions between Scala programming and Python programming is a daunting task, it is important for you to eloquently understand the difference and chose a language goal, according to your career graph as well as interest areas in big data analytics.

If you are planning to learn both eventually, I'd recommend starting with Python programming. It is the closest to what you already know in data science (mostly imperative and object-oriented, it seems), yet it has functional programming features that will help you get used to the new. After you get used to functional programming, Scala may become easier to learn.

You may notice that changing paradigms is a lot harder than a simple change of syntax. But hybrid languages like Python may help, as you don't have to change your way of thinking and coding all at once, and instead doing that at your own pace.

On the other hand, if you only plan to pick one, take your future programming plans into account. There are a few questions that you must answer yourself before picking one language.

Do you want to leverage the JAVA ecosystem or the Python ecosystem?

If this were a different comparison, there might be a more definitive answer to that question, but with Java and Python, you'll pretty much have access to everything you need. If you know what you might be building, look in PyPI and Maven (or use google) and see what libraries are available. Compare the syntax, commit history, github stars, people using them, etc. Which ecosystem better supports your potential use case(s)?

Both scala programming and python programming are in demand, but as a Scala developer, you should probably have some familiarity with Java and the Java ecosystem. Learning the Java ecosystem is a much bigger task than learning the Python ecosystem, and you're likely going to be in the market competing against developers with a solid Java background.

Both the JVM and Python interpreter are fairly ubiquitous, but you may have better support on your target platform for one over the other.

Scala is faster than Python in the vast majority of use cases. Pypy is a Python interpreter with a build in JIT compiler. PyPy is very fast, but doesn't support most Python C extensions, so, depending on the libraries you're using, the cPython interpreter with C extensions for your libraries may outperform PyPy. Where performance is critical, Python often has fast modules written in C, so the particular libraries you're intending to leverage make a difference.

For data, anything matrix related in a JVM language without spending the time to write your own versions of BLAS/LAPAC is horrible. Cleaning data is a pain compared to Python as well. You are also way more likely to get hands on with technology in Python faster  than Java due to its open source nature as well. PyCude for Nvidia  existed before any Java counterpart. . The good scale code takes time and is admittedly more difficult to write than quality Python code which  will discourage many researchers needing quick results who have little  experience with producing quality code. On the upside for scala, the threading is better and it is picking up speed, just not as much as will be had from a language such as Python programming, which benefits from use by scientists and academia as much as from programmers and others.

If the explanation above, has not successfully guided you yet to choose between scala training and python training, here is a brief review that can bring about a definitive answer.

Scala vs Python.

Python

  1. Easy to learn for Java developers
  2. Has a great community
  3. You could do almost anything in Python
  4.  There's nothing really new in Python
  5. Slow in runtime

Scala

  1. Many new features for Java/PHP/C++/JS developers
  2.  As fast as Java thanks to JVM
  3. Not as verbose as Java code
  4. Shares libraries from Java community
  5. Scala community is a little bit lukewarm
  6. Lots of syntactic sugars - be cautious to learn and use them
  7. Language itself is still in evolution

 

+ Recent posts