OAUTH2 Authentication with ADFS 3.0

2016. 5. 9. 22:07

OAUTH2 Authentication with ADFS 3.0

By Chris Price, on the 9th Mar 2015

A quick run through of the steps involved in integrating a Node.js client with Active Directory Federation Services for authentication using OAUTH2.

I recently had the dubious pleasure of proving the feasibility of authenticating apps against ADFS using its OAUTH2 endpoints. In short, whilst it is possible to securely prove identity and other claims, I’m left thinking there must be a better way.

Configuring ADFS for a new OAUTH2 client

I started with an Azure Windows Server 2012 R2 VM pre-configured with an ADFS instance integrated with existing SAML 2.0 clients (or Relying Parties in identity-speak). As I was only interested in proving the OAUTH2 functionality I could piggy-back on one of the existing Trusts. If you need to set one up, this guide might be useful.

To register a new client, from an Administrative PowerShell prompt, run the following -

Add-ADFSClient -Name "OAUTH2 Test Client" -ClientId "some-uid-or-other" -RedirectUri="http://localhost:3000/getAToken"

This registers a client called OAUTH2 Test Client which will identify itself as some-uid-or-other and provide http://localhost:3000/getAToken as the redirect location when performing the authorization request (A) to the Authorization Server (in this case ADFS).

The Authorization Code Flow

+----------+
| Resource |
|   Owner  |
|          |
+----------+
     ^
     |
    (B)
+----|-----+          Client Identifier      +---------------+
|         -+----(A)-- & Redirection URI ---->|               |
|  User-   |                                 | Authorization |
|  Agent  -+----(B)-- User authenticates --->|     Server    |
|          |                                 |               |
|         -+----(C)-- Authorization Code ---<|               |
+-|----|---+                                 +---------------+
  |    |                                         ^      v
 (A)  (C)                                        |      |
  |    |                                         |      |
  ^    v                                         |      |
+---------+                                      |      |
|         |>---(D)-- Authorization Code ---------'      |
| Client  |          & Redirection URI                  |
|         |                                             |
|         |<---(E)----- Access Token -------------------'
+---------+       (w/ Optional Refresh Token)

The diagram above, taken from the OAUTH2 RFC, represents the Authorization Code Flow which is the only flow implemented by ADFS 3.0. This is the exchange that’s going to end up taking place to grant a user access. It’s pretty easy to understand but it’s worth pointing out that - Some of the requests and responses go via the User-Agent i.e. they’re HTTP redirects. (B) is a double-headed arrow because it represents an arbitrary exchange between the Authorization Server (ADFS) and the Resource Owner (user) e.g. login form -> submit -> wrong password -> submit.

The ADFS 3.0 Authorization Code Flow

The OAUTH2 specification isn’t any more specific than that, I’ll come back to this. So now you need to know what this translates to on the wire. Luckily someone’s already done a great job of capturing this (in more detail than reproduced below).

A. Authorization Request

GET /adfs/oauth2/authorize?response_type=code&client_id=some-uid-or-other&resource=urn%3Arelying%3Aparty%3Atrust%3Aidentifier&redirect_uri=http%3A%2F%2Flocalhost%3A3000%2FgetAToken HTTP/1.1
Host: your.adfs.server

In this request the app asks the ADFS server (via the user agent) for an authorization code with the client_id and redirect_uri we registered earlier and a resource identifier associated with a Relying Party Trust.

This is the bit where the sign-in is handed off to the standard ADFS login screen if you don’t have a session or you’re implicitly signed in if you do. Speaking of that login screen, if you were hoping to meaningfully customise it, forget it.

C. Authorization Grant

HTTP 302 Found
Location: http://localhost:3000/getAToken?code=<the code>

D. Access Token

POST /adfs/oauth2/token HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Host: your.adfs.server
Content-Length: <some number>

grant_type=authorization_code&client_id=some-uid-or-other&redirect_uri=http%3A%2F%2Flocalhost%3A3000%2FgetAToken&code=thecode

E. Access Token

HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8

{ 
    "access_token":"<access_token>",
    "token_type":"bearer",
    "expires_in":3600
}

Establishing the user’s identity and other grants

The interesting bit is the <access_token> itself, it is in fact a JSON Web Token (JWT). That’s to say a signed representation of the user’s identity and other grants. You can either opt to trust it if you retrieved it over a secure channel from the ADFS server, or validate it using the public key of the configured Token Signing Certificate.

Here’s the example Node.js implementation I created, which opts to validate the token. The validation itself is performed by the following snippet -

var adfsSigningPublicKey = fs.readFileSync('ADFS-Signing.cer'); // Exported from ADFS
function validateAccessToken(accessToken) {
    var payload = null;
    try {
        payload = jwt.verify(accessToken, adfsSigningPublicKey);
    }
    catch(e) {
        console.warn('Dropping unverified accessToken', e);
    }
    return payload;
}

Obtaining refresh tokens from ADFS 3.0

Refresh tokens are available from the ADFS implementation but you need to be aware of the settings detailed in this blog post. To set them you’d run the following from an Administrative PowerShell prompt -

Set-AdfsRelyingPartyTrust -TargetName "RPT Name" -IssueOAuthRefreshTokensTo AllDevices
Set-AdfsRelyingPartyTrust -TargetName "RPT Name" -TokenLifetime 10
Set-AdfsProperties -SSOLifetime 480

This would issue access tokens with a lifetime of 10 minutes and refresh tokens to all clients with a lifetime of 8 hours.

Conclusion

Whilst I did get the OAUTH2 integration to work, I was left a bit underwhelmed by it especially when compared to the features touted by AzureAD. Encouraged by TechNet library docs, I’d initially considered ADFS to be compatible with AzureAD and tried to get ADAL to work with ADFS. However, I quickly discovered that it’s expecting an OpenID Connect compatible implementation and that’s something ADFS does not currently offer.

It might be my lack of Google foo, but this became typical of the problems I had finding definitive documentation. I think this is just one of the problems associated with the non-standardised OAUTH2 standard. Another is the vast amount of customisation you must do to make an OAUTH2 library work with a given implementation. OpenID Connect looks like a promising solution to this, but only time will tell if it gains significant adoption.

When things go wrong…

Whilst trying to work out the correct configuration, I ran into a number of errors along the way. Most of them pop out in the ADFS event log but occasionally you might also get a helpful error response to an HTTP request. Here’s a brief summary of some of the ones I encountered and how to fix them -

Microsoft.IdentityServer.Web.Protocols.OAuth.Exceptions. OAuthInvalidClientException: MSIS9223: Received invalid OAuth authorization request. The received ‘client_id’ is invalid as no registered client was found with this client identifier. Make sure that the client is registered. Received client_id: ‘…’.

When making the authorize request, you either need to follow the process above for registering a new OAUTH2 client or you’ve mistyped the identifier (n.b. not the name).

Microsoft.IdentityServer.Web.Protocols.OAuth.Exceptions. OAuthInvalidResourceException: MSIS9329: Received invalid OAuth authorization request. The ‘resource’ parameter’s value does not correspond to any valid registered relying party. Received resource: ‘…’.

When making the authorize request you’ve either got a typo in your RPT identifier, you need to create an RPT with the given identifier or you need to register it against an existing RPT.

Microsoft.IdentityServer.Web.Protocols.OAuth.Exceptions. OAuthAuthorizationMissingResourceException: MSIS9226: Received invalid OAuth authorization request. The ‘resource’ parameter is missing or found empty. The ‘resource’ parameter must be provided specifying the relying party identifier for which the access is requested.

When making the authorize request, you’ve not specified a resource parameter, see previous. I found that most OAUTH2 libraries expect to pass a scope but not a resource parameter.

HTTP error 503

This normally meant I had a typo in the /adfs/oauth2/authorize or /adfs/oauth2/token URLs (don’t forget the 2).

저작자표시

'머신러닝' 카테고리의 다른 글

Machine learning with Python: A Tutorial (0)	2016.06.01
First Contact with TensorFlow (0)	2016.05.09
Gradientzoo – 머신러닝 모델 공유 (0)	2016.05.09
기계 학습 알고리즘 치트 시트 다운로드 (0)	2016.05.09
5 Ways Machine Learning Is Reshaping Our World (0)	2016.05.09

오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper)

2016. 5. 9. 22:05

IDINCU Engineering

오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper)

2014/09/02 Irvin 1 Comment

1. 필요성

요새 그렇지 않은 서비스가 어디있겠냐마는, 저희 서비스 역시 하루에 수십~수백만 건의 로그가 군데 군데 쌓이고 있으며, 수십~수백만 건의 설문 응답 데이터 등이 중앙 RDBMS에 저장되고 있습니다. 그런데 최근 서비스 기반 아키텍쳐 (SOA)로 시스템을 점차 전환하면서 각각의 분산 서버에서는 기존의 중앙 RDBMS에 직접 접근하지 않게 되자, 제각기 발생하는 엄청나게 다양한 로그 및 데이터를 효율적으로 관리/분배할 수 있는 시스템이 필요하게 되었습니다. 대표적으로 Netflix 사에서 공개한 Suro라는 오픈소스 프로젝트가 있었지만, 데이터를 전송하는 transport layer부터 데이터를 저장하는 sink 단까지 저희 서비스에 딱 맞게 입맛대로 구성하고 싶었고 (non-java client 지원, metadata 등을 능동적으로 읽어 오기도 하는 시스템에 융합된 sink 개발, …) 새로이 개발하는 공수가 대단히 클 것 같지는 않았기에 직접 만들어 보게 되었습니다. ~~라고 쓰고 시켜서 만들었다고 읽습니다 /ㅁ/~~

2. 사용된 오픈소스

1) Thrift
가장 많은 트래픽을 감당해야 하는 시스템이니만큼 속도와 안정성 측면에서 충분한 검증을 받은 라이브러리를 transport layer로 선택하여야 했습니다. Protocol Buffer나 Avro에 대한 고려도 해 보았지만, ~~역시나 Facebook의 이름값에 넘어가~~ 비교적 가장 많이들 사용하고 있는 Thrift를 사용하기로 결정하였습니다.

2) Zookeeper
데이터를 어떻게 분배할 것인가에 대한 rule이나, 현재 동작 중인 서버들의 상태 등을 관리하기 위해 사용하였습니다. 무난하게 RDBMS를 사용할 수도 있지만 매번 rule을 조회하기에는 부하가 너무 크고, cache를 사용하기에는 rule 변경 시 바로 적용되지도 않을뿐더러 서버마다 cache가 만료되는 시점이 달라 문제가 생길 수가 있습니다. 하지만 Zookeeper를 사용할 경우 rule의 변경 사항이 생길 때 곧바로 푸시를 해 줄 수 있고, 커넥션이 끊어질 경우 데이터가 휘발되게 하는 옵션이 있어 서버의 동작 여부를 알아 보기에도 매우 용이합니다.

3) Curator
원래도 아주 간단한 Zookeeper이긴 하지만, 그 Zookeeper를 보다 더 간편하게 사용할 수 있도록 해 주는 라이브러리입니다.

3. 프로젝트 구성

본 프로젝트는 thrift-lib, server, coordinator, client, 총 4개의 서브 프로젝트로 구성되어 있습니다. 각 서브 프로젝트의 역할과 함께 Thrift, Zookeeper의 활용 방법도 소개드리겠습니다.

1) thrift-lib
말 그대로 서비스 전반에서 thrift를 기반으로 통신하기 위해 데이터 및 호출 스펙을 정의한 서브 프로젝트입니다. 다른 모든 서브 프로젝트에서 본 서브 프로젝트를 import하여 사용합니다.

namespace java com.idincu.blog

struct Blog {
    1: string type,
    2: string sender,
    3: i64 timestamp,
    4: optional string level,
    5: optional string message,
    6: optional string formattedMessage,
    7: optional list argumentArray,
    8: optional string loggerName,
    9: optional string threadName,
    10: optional list callerData,
    11: optional map&lt;string, string&gt; extraData
}

service ServerService {
    void ping(),
    byte send(1: Blog blog)
}

service CoordinatorService {
    void ping(),
    list&lt;string&gt; getServerList()
}

namespace java com.idincu.blog

struct Blog {

1: string type,

2: string sender,

3: i64 timestamp,

4: optional string level,

5: optional string message,

6: optional string formattedMessage,

7: optional list argumentArray,

8: optional string loggerName,

9: optional string threadName,

10: optional list callerData,

11: optional map<string, string> extraData

}

service ServerService {

void ping(),

byte send(1: Blog blog)

}

service CoordinatorService {

void ping(),

list<string> getServerList()

}

위와 같은 형식으로 [filename].thrift 파일을 생성한 후

thrift –gen java [filename].thrift

와 같이 명령을 내리면 gen-java 디렉터리 안에 java에서 사용할 수 있는 통신 라이브러리를 자동으로 생성해 줍니다. 해당 코드를 사용하여 아주 간편하게 통신을 할 수 있습니다. 물론 java 외에도 수많은 언어를 지원하기 때문에 나중에 다른 언어의 server나 client를 개발하기에도 아주 용이합니다.

2) server
server는 client로부터 다량의 데이터를 받아 큐에 쌓아 두고, 별개의 thread에서는 큐를 읽어 정해진 rule에 따라 데이터를 여러 sink에 분배합니다. 로컬 파일로 저장하는 sink, graylog로 전송하는 sink, email로 전송하는 sink, 자체적으로 관리하는 storage 서비스에 전송하는 sink 등 다양한 sink를 만들어 붙일 수 있겠습니다. 앞에서 언급했던 Zookeeper는 다음과 같이 활용하고 있습니다.

        // connect to zookeeper
        RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 3);
        CuratorFramework client = CuratorFrameworkFactory.newClient("[connectString(list of zookeeper servers)]", retryPolicy);
        client.start();

        // rule 모니터링
        final NodeCache nc = new NodeCache(client, "[rule이 기록 된 node의 path]");
        nc.start();
        nc.getListenable().addListener(new NodeCacheListener() {
            @Override
            public void nodeChanged() throws Exception {
                // rule 갱신 로직
            }
        });

// connect to zookeeper

RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 3);

CuratorFramework client = CuratorFrameworkFactory.newClient("[connectString(list of zookeeper servers)]", retryPolicy);

client.start();

// rule 모니터링

final NodeCache nc = new NodeCache(client, "[rule이 기록 된 node의 path]");

nc.start();

nc.getListenable().addListener(new NodeCacheListener() {

@Override

public void nodeChanged() throws Exception {

// rule 갱신 로직

}

});

위와 같이 특정 node를 watch하여 변경 사항이 생길 시 listener를 통해 즉시 반영할 수 있습니다. 또한 아래와 같이 Zookeeper에 노드를 생성할 수도 있으며, CreateMode.EPHEMERAL로 해당 커넥션이 종료될 경우 노드가 자동으로 삭제되게 할 수도 있습니다. 참고로 위의 node watch 기능이 거의 latency 없이 작동했던 반면, 커넥션이 종료된 후 node가 사라지는 데까지는 수십 초 가량의 시간이 걸려, 서버 재시작 시 기존 node가 살아 있어 익셉션이 나는 것을 방지하기 위해 retry 처리를 해 주어야 했습니다.

        // zookeeper에 노드 생성
        while (true) {
            try {
                client.getData().forPath("/status/" + hostname);
            } catch (KeeperException.NoNodeException e) {
                break;
            }
            System.out.println("Retrying registration...");
            Thread.sleep(5000);
        }
        client.create().withMode(CreateMode.EPHEMERAL).forPath("/status/" + hostname);

// zookeeper에 노드 생성

while (true) {

try {

client.getData().forPath("/status/" + hostname);

} catch (KeeperException.NoNodeException e) {

break;

}

System.out.println("Retrying registration...");

Thread.sleep(5000);

}

client.create().withMode(CreateMode.EPHEMERAL).forPath("/status/" + hostname);

3) coordinator
coodinator는 위에서 server들이 Zookeeper에 생성에 놓은 node들을 토대로 client에게 현재 서버들의 상태를 알려 주는 아주 간단한 역할을 맡고 있습니다. HAProxy 같은 load balancer를 사용할 수도 있겠지만 이처럼 로그와 같은 대량의 데이터가 실시간으로 몰리는 시스템에서는 load balancer 그 자체가 single point of failure가 될 수 있기 때문에, 맨 처음 접속 시나 장애 시에만 coordinator로부터 서버들의 정보를 받아온 후로는 각 서버에 바로 접속하는 구조를 사용하게 되었습니다. 다음은 앞서 1)에서 설명했던 thrift-lib을 이용하여 서버를 구동하는 코드입니다. 보시는 바와 같이 아주 간단하게 적용할 수 있으며, Nonblocking 서버 외에도 ThreadPool 서버나 Simple 서버 등을 제공합니다.

public class CoordinatorServer {
    public static CoordinatorHandler handler;
    public static CoordinatorService.Processor processor;

    public static void main(String[] args) throws Exception {
        handler = new CoordinatorHandler();
        processor = new CoordinatorService.Processor(handler);

        Runnable simple = new Runnable() {
            public void run() {
                simple(processor);
            }
        };

        new Thread(simple).start();
    }

    public static void simple(CoordinatorService.Processor processor) {
        try {
            TNonblockingServerTransport serverTransport = new TNonblockingServerSocket([port]);
            TServer server = new TNonblockingServer(new TNonblockingServer.Args(serverTransport).processor(processor));

            System.out.println("Starting the coordinator server...");
            server.serve();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

public class CoordinatorServer {

public static CoordinatorHandler handler;

public static CoordinatorService.Processor processor;

public static void main(String[] args) throws Exception {

handler = new CoordinatorHandler();

processor = new CoordinatorService.Processor(handler);

Runnable simple = new Runnable() {

public void run() {

simple(processor);

}

};

new Thread(simple).start();

}

public static void simple(CoordinatorService.Processor processor) {

try {

TNonblockingServerTransport serverTransport = new TNonblockingServerSocket([port]);

TServer server = new TNonblockingServer(new TNonblockingServer.Args(serverTransport).processor(processor));

System.out.println("Starting the coordinator server...");

server.serve();

} catch (Exception e) {

e.printStackTrace();

}

public class CoordinatorHandler implements CoordinatorService.Iface {
    private CuratorFramework client;

    public CoordinatorHandler() throws Exception {
        RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 3);
        client = CuratorFrameworkFactory.newClient(prop.getProperty("[connectString(list of zookeeper servers)]"), retryPolicy);
        client.start();
    }

    @Override
    public void ping() throws TException {
    }

    @Override
    public List&lt;String&gt; getServerList() throws TException {
        List&lt;String&gt; serverList = null;
        try {
            serverList = client.getChildren().forPath("/status");
        } catch (Exception e) {
            e.printStackTrace();
        }
        Collections.shuffle(serverList);
        System.out.println(serverList + " served.");
        return serverList;
    }
}

public class CoordinatorHandler implements CoordinatorService.Iface {

private CuratorFramework client;

public CoordinatorHandler() throws Exception {

RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 3);

client = CuratorFrameworkFactory.newClient(prop.getProperty("[connectString(list of zookeeper servers)]"), retryPolicy);

client.start();

}

@Override

public void ping() throws TException {

}

@Override

public List<String> getServerList() throws TException {

List<String> serverList = null;

try {

serverList = client.getChildren().forPath("/status");

} catch (Exception e) {

e.printStackTrace();

}

Collections.shuffle(serverList);

System.out.println(serverList + " served.");

return serverList;

}

4) client
client는 구동시 coorinator에 한 번 접속하여 server list를 받아온 후, 한 server에 접속하여 데이터를 전송할 수 있게 해 줍니다.

            TTransport coordinatorTransport = new TFramedTransport(new TSocket(coordinatorUrl, [port]));
            coordinatorTransport.open();
            TProtocol coordinatorProtocol = new TBinaryProtocol(coordinatorTransport);
            CoordinatorService.Client client = new CoordinatorService.Client(coordinatorProtocol);
            serverList = client.getServerList();

TTransport coordinatorTransport = new TFramedTransport(new TSocket(coordinatorUrl, [port]));

coordinatorTransport.open();

TProtocol coordinatorProtocol = new TBinaryProtocol(coordinatorTransport);

CoordinatorService.Client client = new CoordinatorService.Client(coordinatorProtocol);

serverList = client.getServerList();

client 측 코드 역시 Thrift를 사용하여 간단하게 작성할 수 있으며, logback을 통해 로그 또한 간편하게 전송할 수 있도록 새로운 appender도 만들어 주었습니다.

public class BlogAppender extends AppenderBase&lt;ILoggingEvent&gt; {
    private String coordinatorUrl;
    private String sender;

    public String getCoordinatorUrl() {
        return coordinatorUrl;
    }

    public void setCoordinatorUrl(String coordinatorUrl) {
        this.coordinatorUrl = coordinatorUrl;
    }

    public String getSender() {
        return sender;
    }

    public void setSender(String sender) {
        this.sender = sender;
    }

    @Override
    public void start() {
        super.start();
        // coordinator, server 접속 등 초기화
    }

    @Override
    protected void append(ILoggingEvent event) {
        // server로 로그 전송
    }
}

public class BlogAppender extends AppenderBase<ILoggingEvent> {

private String coordinatorUrl;

private String sender;

public String getCoordinatorUrl() {

return coordinatorUrl;

}

public void setCoordinatorUrl(String coordinatorUrl) {

this.coordinatorUrl = coordinatorUrl;

}

public String getSender() {

return sender;

}

public void setSender(String sender) {

this.sender = sender;

}

@Override

public void start() {

super.start();

// coordinator, server 접속 등 초기화

}

@Override

protected void append(ILoggingEvent event) {

// server로 로그 전송

}

이렇게 appender를 만들고 난 후 logback.xml에 추가하면 @Slf4j 어노테이션을 통해 간편하게 로그를 전송할 수 있게 됩니다.

&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;!DOCTYPE configuration&gt;
&lt;configuration&gt;
 &lt;appender name="console" class="ch.qos.logback.core.ConsoleAppender"&gt;
 &lt;encoder&gt;
 &lt;charset&gt;UTF-8&lt;/charset&gt;
 &lt;pattern&gt;%date [%level] %msg - %logger [%file : %line] %X{RequestURL} %X{ParameterMap} %X{RequestedSessionId} %X{RemoteHost}%n&lt;/pattern&gt;
 &lt;/encoder&gt;
 &lt;/appender&gt;
 &lt;appender name="canal" class="com.idincu.blog.BlogAppender"&gt;
 &lt;coordinatorUrl&gt;[coordiantor url]&lt;/coordinatorUrl&gt;
 &lt;sender&gt;blog&lt;/sender&gt;
 &lt;/appender&gt;

 &lt;logger name="com.idincu"&gt;
 &lt;level value="debug"/&gt;
 &lt;/logger&gt;

 &lt;root&gt;
 &lt;level value="info"/&gt;
 &lt;appender-ref ref="console"/&gt;
 &lt;appender-ref ref="canal"/&gt;
 &lt;/root&gt;
&lt;/configuration&gt;

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE configuration>

<pattern>%date [%level] %msg - %logger [%file : %line] %X{RequestURL} %X{ParameterMap} %X{RequestedSessionId} %X{RemoteHost}%n</pattern>

</encoder>

</appender>

<coordinatorUrl>[coordiantor url]</coordinatorUrl>

</appender>

</logger>

<root>

<appender-ref ref="console"/>

<appender-ref ref="canal"/>

</root>

</configuration>

4. 마치며

결과적으로 이런 시스템이 완성되었습니다! 저희는 현재 본 시스템을 사용하여 데이터를 분배하고 있으며, 이 외에도 사내 클라우드 시스템 또한 오픈소스를 잘 사용하여 개발, 사용하고 있습니다. 오픈소스를 이용하는 중에 문제가 보이면 직접 contribute하는 분들도 계시고요. 이미 다들 그러시겠지만, 혹 아직 그렇지 않은 분들도 계시다면 편리하고 보람찬 오픈소스 생활을 누려 봅시다 : D

저작자표시

'빅데이터' 카테고리의 다른 글

Comparison of Apache Stream Processing Frameworks: Part 2 (0)	2016.05.11
Comparison of Apache Stream Processing Frameworks: Part 1 (0)	2016.05.11
Scala Vs Python Which Language You Should Learn (0)	2016.05.09
이베이, 실시간 분석 플랫폼 ‘펄사’ 오픈소스로 공개 (0)	2016.05.09
빅데이터, ‘파이썬’에 길을 묻다 (0)	2016.05.09

Scala Vs Python Which Language You Should Learn

2016. 5. 9. 22:02

As a programmer, you must have or would face this question on the path to career progression in the field of data science. While designing definitive distinctions between Scala programming and Python programming is a daunting task, it is important for you to eloquently understand the difference and chose a language goal, according to your career graph as well as interest areas in big data analytics.

If you are planning to learn both eventually, I'd recommend starting with Python programming. It is the closest to what you already know in data science (mostly imperative and object-oriented, it seems), yet it has functional programming features that will help you get used to the new. After you get used to functional programming, Scala may become easier to learn.

You may notice that changing paradigms is a lot harder than a simple change of syntax. But hybrid languages like Python may help, as you don't have to change your way of thinking and coding all at once, and instead doing that at your own pace.

On the other hand, if you only plan to pick one, take your future programming plans into account. There are a few questions that you must answer yourself before picking one language.

Do you want to leverage the JAVA ecosystem or the Python ecosystem?

If this were a different comparison, there might be a more definitive answer to that question, but with Java and Python, you'll pretty much have access to everything you need. If you know what you might be building, look in PyPI and Maven (or use google) and see what libraries are available. Compare the syntax, commit history, github stars, people using them, etc. Which ecosystem better supports your potential use case(s)?

Both scala programming and python programming are in demand, but as a Scala developer, you should probably have some familiarity with Java and the Java ecosystem. Learning the Java ecosystem is a much bigger task than learning the Python ecosystem, and you're likely going to be in the market competing against developers with a solid Java background.

Both the JVM and Python interpreter are fairly ubiquitous, but you may have better support on your target platform for one over the other.

Scala is faster than Python in the vast majority of use cases. Pypy is a Python interpreter with a build in JIT compiler. PyPy is very fast, but doesn't support most Python C extensions, so, depending on the libraries you're using, the cPython interpreter with C extensions for your libraries may outperform PyPy. Where performance is critical, Python often has fast modules written in C, so the particular libraries you're intending to leverage make a difference.

For data, anything matrix related in a JVM language without spending the time to write your own versions of BLAS/LAPAC is horrible. Cleaning data is a pain compared to Python as well. You are also way more likely to get hands on with technology in Python faster than Java due to its open source nature as well. PyCude for Nvidia existed before any Java counterpart. . The good scale code takes time and is admittedly more difficult to write than quality Python code which will discourage many researchers needing quick results who have little experience with producing quality code. On the upside for scala, the threading is better and it is picking up speed, just not as much as will be had from a language such as Python programming, which benefits from use by scientists and academia as much as from programmers and others.

If the explanation above, has not successfully guided you yet to choose between scala training and python training, here is a brief review that can bring about a definitive answer.

Scala vs Python.

Python

Easy to learn for Java developers
Has a great community
You could do almost anything in Python
There's nothing really new in Python
Slow in runtime

Scala

Many new features for Java/PHP/C++/JS developers
As fast as Java thanks to JVM
Not as verbose as Java code
Shares libraries from Java community
Scala community is a little bit lukewarm
Lots of syntactic sugars - be cautious to learn and use them
Language itself is still in evolution

저작자표시

'빅데이터' 카테고리의 다른 글

Comparison of Apache Stream Processing Frameworks: Part 2 (0)	2016.05.11
Comparison of Apache Stream Processing Frameworks: Part 1 (0)	2016.05.11
오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper) (0)	2016.05.09
이베이, 실시간 분석 플랫폼 ‘펄사’ 오픈소스로 공개 (0)	2016.05.09
빅데이터, ‘파이썬’에 길을 묻다 (0)	2016.05.09

PREV 1 ···7 8 9 10 11 NEXT

개발자 블로그

OAUTH2 Authentication with ADFS 3.0

Configuring ADFS for a new OAUTH2 client

The Authorization Code Flow

The ADFS 3.0 Authorization Code Flow

A. Authorization Request

C. Authorization Grant

D. Access Token

E. Access Token

Establishing the user’s identity and other grants

Obtaining refresh tokens from ADFS 3.0

Conclusion

When things go wrong…

'머신러닝' 카테고리의 다른 글

오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper)

오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper)

1. 필요성

2. 사용된 오픈소스

3. 프로젝트 구성

4. 마치며

'빅데이터' 카테고리의 다른 글

Scala Vs Python Which Language You Should Learn

'빅데이터' 카테고리의 다른 글

+ Recent posts

티스토리툴바

개발자 블로그

OAUTH2 Authentication with ADFS 3.0

Configuring ADFS for a new OAUTH2 client

The Authorization Code Flow

The ADFS 3.0 Authorization Code Flow

A. Authorization Request

B. The Actual Login Bit…

C. Authorization Grant

D. Access Token

E. Access Token

Establishing the user’s identity and other grants

Obtaining refresh tokens from ADFS 3.0

Conclusion

When things go wrong…

'머신러닝' 카테고리의 다른 글

오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper)

오픈소스를 이용한 사내 데이터 분배 시스템 개발 (Feat. Thrift, Zookeeper)

1. 필요성

2. 사용된 오픈소스

3. 프로젝트 구성

4. 마치며

'빅데이터' 카테고리의 다른 글

Scala Vs Python Which Language You Should Learn

'빅데이터' 카테고리의 다른 글

+ Recent posts

티스토리툴바