Thinking@Scale Yan Qi     About     Feed

Career Planning

Long View Approach - Career Planning

In the past 20 years, the human life expectancy has been improved significantly, and the retirement age has been rising. In other words, the retirement is starting later but lasting longer. People used to think that careers would be over when they are around 40s. However, it may not be even at the halfway point. Actually people tend to underestimate the length of a career. Therefore it is necessary to plan for a long career journey, especially if a successful career is concerned.

Generally careers can be divided into three stages:

  1. Start strong in the first 15 years of the career;
  2. Reach high in the middle;
  3. Go far near or even beyond retirement.

The book The Long View tries to introduce us to a set of career mindset, framework, and tools, to help us learn how to collect the ‘fuel’ to achieve our career goals at the different stages. As a result of reading and learning, I made a presentation based on the book, hopefully it could highlight the main points.

Clean Architecture

Built with simple rules: water, air, sun, gravity

Software development has many similarities with building construction. There are a few of rules that seem simple, like the physics of gravity in the real physical world or the single responsibility principle (SRP) in the programming. However, not all developers can necessarily use them well, especially in the complex scenario. An architect should have a good understanding of those principles first, and grow a pair of sharp eyes to see through the complexity, such that she can apply those rules to achieve a clean architecture.

Uncle Bob in his book, Clean Architecture: A Craftman’s Guide to Software Structure and Design gives a detailed description on those principles. More importantly he tries to explain how clean architecture can be achieved with the help of them. As a result of reading and learning, I made a presentation based on the book, hopefully it could highlight the main points.

Clean Agile

Teamwork is everywhere, especially important in the human society. For example in the software project, it cannot be emphasized too much as long as more than one person get involved. There can be many aspects affecting the performance of teamwork. The keys are about communication and collaboration.

In my not-so-long career as software engineer, I found one of biggest challenges that prevent developers from delivering a successful software product is due to the communication gap between them and their business partners. Many failures could be avoided if both parties are able to sync-up earlier. However, timing is not the only factor. The communication may lead nowhere if the common language is absent. The business people often use a human language, like English to describe what they need, or the specifications; whereas, developers prefer more formal languages, typically thinking of translating the business specifications into code (e.g., acceptance tests). This difference clearly causes a challenge.

Agile tries to address the challenge faced by a small group of software developers with a feedback-driven approach. Therefore a software project is composed of many small cycles, each of which aims to provide a working or deliverable product that their business partners can review and both sides would discuss and decide what to do next. Instead of particular rules or steps, Agile emphasizes a set of principles and values, and encourages to cultivate a culture out of those. The book by Robert C. Martin, Clean Agile: Back to Basics gives a very clear explanation on these values and principles. Furthermore, it provides quite a few guides for applying Agile in practice. As a result of reading and learning, I made a presentation based on the book, hopefully it could highlight the main points.

RecordBuffer - A Data Serialization Approach in DataMine

Data serialization is a basic problem that every data system has to deal with. To provide an efficient solution, a data serialization approach should be able to arrange the data into a compact binary format which is independent of any particular application. Nowadays there are some open-source projects on the data serialization system, such as Avro, Protocol Buffer and Thrift. These projects have reasonable large communities and are rather successful in different application scenarios. They are generally applicable in different applications, more or less following the similar ideas when serializing the data and providing APIs for message exchanges. These approaches are for general purpose and usually working well. Additionally it is also possible for them to work with other data formats such as Parquet to provide variety of options.

However, they could do better when applied to the data with nested structure. For example, the in-memory record representation may consume the similar memory even though only a few of columns have meaningful values. On the other hand, it is possible to improve the deserialization performance with the help of index.

DataMine, the data warehouse of Turn, exploits a flexible, efficient and automated mechanism to manage the data storage and access. It describes the data structure in DataMine IDL, follows a code generation approach to define the APIs for data access and schema reading. A data encoding scheme, RecordBuffer is applied to the data serialization/de-serialization. RecordBuffer depicts the content of a table record as a byte array. Particularly RecordBuffer has the following structure.

  • Version No. specifies what version of schema this record uses; it is required and takes 2 bytes.
  • The number of attributes in the table schema is required and takes 2 bytes.
  • Reference section length is the number of bytes used for the reference section; it is required and takes 2 bytes.
  • Sort-key reference stores the offset of the sort key column if exists; it is optional and takes 4 bytes if exists.
  • The number of collection-type attributes uses 1 byte for the number of collections in the table, and it is required.
  • Collection-type field references store the offsets of the collections in the table sequentially; note that the offset of an empty collection is -1.
  • The number of non-collection-type field reference uses 1 byte for the number of non-collection-type columns which have hasRef annotation.
  • Non-collection-type field references sequentially store the ID and offset pair of columns with hasRef annotation if exist.
  • Bit mask of attributes is a series of bytes to indicate the availability of any attributes in the table.
  • Attribute values store the values of available attributes in sequence; note that the sequence should be the same as that defined in the schema.

Different from other encoding schemes, RecordBuffer has a reference section which allows index or any record-specific information. Having index in the reference section can locate the field (like sort-key) directly, simplifying data de-serialization significantly. On the other hand, the frequently-accessed derived values can be stored in the reference section to speed up data analytics. This is quite useful when nested data are allowed. For example, a summary on the nested attribute values can be derived and stored in the reference section, such that the deserialization of the nested table (usually very costly) can be avoid when applying aggregation to the attribute.

Deploy Application As Data (DaaD)

The distributed computing stack commonly uses a layered structure. A functionally independent component is defined on each layer, and different layers are connected through APIs. This structure makes it quite easy for system to scale. One example of such a system can be composed of local OS/FS, distributed FS, resource management system, distributed computing frameworks, and applications. Nowadays, the HDFS is often used as the distributed file system. Yarn is one example of the resource management systems, whereas Spark can be one of promising computing frameworks.

The key of distributed computing is to run the same code on the different parts of data then aggregate the results into the final solution. Particularly, the data are first partitioned, replicated and distributed across the cluster nodes. When an application is submitted, the resource management system decides how much resource is allocated and where the code can be run (usually on the nodes where the input data are stored, so called data locality). The computing framework devises a job or work plan for the application, which may be made up of tasks. More often than not a driver is issued in the client side (e.g., lines in green) or by a worker (e.g., lines in orange). The driver initializes the job and coordinates the job and task execution. Each task is executed by a worker running on the node and the result can be shuffled, sorted and copied to another node where the further execution would be done.

There are two popular ways to deploy the application code for execution.

  1. Deploy the code in the cluster nodes - This approach distributes the application to every node of the cluster and the client server. In other words, all involved nodes in the system have a copy of the application code. It is not common, but in some cases it is necessary when the application depends on the code running in the node. The disadvantages of this approach are obvious. First the application and the computing system have a strong coupling, such that any change from either side could potentially cause issues to the other. Second, the code deployment becomes very tedious and error prone. Think about the case where some nodes in the distributed environment fail during the code deployment. The state of cluster becomes unpredictable when those failed nodes come alive with the old version of code.
  2. Deploy the code in the client only - A more common strategy is to deploy the application code to client server only. When running the application, the code is first distributed to the cluster nodes with some caching mechanism, such as the distributed cache in the Hadoop. This simple but effective approach could decouple the application and its underneath computing framework very well. However when the number of clients is large, the deployment can become nontrivial. Also if the size of application is very large, the job may have a long initialization process as the code needs distributing across the cluster.

DaaD: Deploy Application As Data

In the distributed computing, the code and the data are traditionally treated differently. The data can be uploaded to the cloud and then copied and distributed by means of the file system utilities. However the code deployment is usually more complex. For example the network topology of application nodes must be well defined beforehand. A sophisticated sync-up process is often required to ensure the consistency and efficiency, especially when the number of application nodes is large.

Therefore if the code can be deployed as data (i.e., DaaD), the code deployment can be much simpler. The DaaD is a two-phase process.

  1. The application code is uploaded to the distributed file system just as common data files.
  2. When running the application, a launcher is used to load the code from the distributed file system, store the code in the distributed cache accessible to all nodes and issue the execution on node.

Clearly, the launcher is required to deployed to the client where the execution request is submitted. It should be independent of any specific applications. An example can be found at DaaD1.

The improvement by the DaaD can be significant.

  • The deployment becomes much simpler. Often the code can be uploaded to the distributed file system through a simple command. Then the code replicating and distributing can be achieved automatically through the file system utilities.

  • The launcher can be defined as a simple class or executable file, which is quite stable. It is trivial to distribute it to the application node.

  • The application code is loaded for execution only if an execution request is issued. Namely the code is actually copied and distributed in the lazy way. Importantly the latest version of code is always used.

  • Having no local copy can avoid of code inconsistency problem.

  • It makes it much easier for different code versions coexist in production. Image the scenario where it is necessary to run the same application with different code versions for A-B test.