Thinking@Scale2021-12-06T06:37:37+00:00https://yan-qi.github.ioUnderstand Data Models2021-12-03T00:00:00+00:00https://yan-qi.github.io/2021/12/03/DDIA-2<p><img src="/public/DDIA-ch2.jpg" alt="Design Data-Intensive Applications - Chapter 2" /></p>
<p>A data model describes the data with objects and relationships.
<em>Document model</em> was used in IBM’s Information Management System (IMS), the popular database for business data processing in the 1970s. It did quite well when the relationship among objects is one-to-many. When more complex relationships like many-to-many are involved, normalization is necessary to eliminate the duplicates and JOIN is required for data combination. However the document model can support neither normalization nor JOIN, in the efficient way. Two models were introduced to address the challenge by many-to-many relationships: the <em>network model</em> and the <em>relational model</em>.</p>
<p>The <em>network model</em> is a natural generalization of hierarchical or tree structure of the document model. In the network model, an object can link to more than one child and be linked from multiple parents; therefore many-to-one or many-to-many relationships are allowed. To access a record or object, <em>an access path</em> must be provided to follow a path from a root record along these chains of links. It is possible for many paths to reach the same object if many-to-many relationships exist in the network. The developer should keep all these in mind while retrieving any records. Thus the complexity of querying and updating the data can be a huge concern. The idea died down after 1980s.</p>
<p>On the other hand, the <em>relational model</em> flattens the data into a set of tables (or relations), each of which has a collection of rows (or tuples). The key and foreign key are introduced as an attribute or field into tables to build the relationships. JOIN is supported to combine relations with the foreign keys in the query execution. Conceptually it still needs the access path to query any records in the database, but implicitly. The query optimizer can automatically choose the right and shortest path to execute the query. User doesn’t have to worry about how the query optimizer works and can focus on the business logic mostly. Surprisingly the relational model thrives decades and until now it dominates the database world.</p>
<p>There is a chapter in the book, <em>Designing Data-Intensive Applications</em> with an overview of the data models and query languages. Here I presented <a href="/public/presentation/DDIA/ch2">the chapter</a>, giving more details about the above.</p>
Data System Design - Reliability, Scalability and Maintainability2021-11-20T00:00:00+00:00https://yan-qi.github.io/2021/11/20/DDIA-1<p><img src="/public/DDIA-ch1.jpg" alt="Design Data-Intensive Applications - Chapter 1" /></p>
<p>A successful data system should be able to meet various requirements while solving the data problems, including functional requirements and nonfunctional requirements.
The functional requirements are often application specific, describing what to be done with the data and how the results can be achieved.
In the nonfunctional requirements, there are many factors affecting the design and implementation, wherein three aspects are so important that all should consider throughout the development cycle: <em>reliability, scalability and maintainability</em>.</p>
<p>Any data system is a <em>software</em> developed by <em>humans</em>, deployed and run in the environment composed of <em>hardwares</em>. It is important to make the system work correctly even when faults occurs. The problem can be caused by hardware faults (e.g., network interruption, disk failure, power outage), software issues (e.g., bugs) and human errors (e.g., misconfiguration, mistakes in operation). <strong>Reliability</strong> introduces the concept and a guidance to exploit fault-tolerance techniques.</p>
<p>In the real world, the data system grows as it has an input with larger data or traffic volume, often more complex. We need a precise measurement of load and performance, based on which the strategies can be taken to keep performance constant, therefore achieve good <strong>scalability</strong>.</p>
<p>Additionally a data system often has a long life cycle, thus its <strong>maintainability</strong> plays a critical role in the course of evolution. As it suggests, “<em>good operations can often work around the limitations of bad software, but good software cannot run reliably with bad operations</em>”. Both engineering and operations teams ought to work together, sometimes grow with the system.</p>
<p>The book <em>Designing Data-Intensive Applications</em> gives a good discussion and helps with a guidance in the design of data system with reliability, scalability and maintainability considered. Here I presented <a href="/public/presentation/DDIA/ch1">the first chapter</a>, as the start of a long journey.</p>
<!-- <iframe src="/public/presentation/DDIA/ch1" style="border: none;" width="600" height="450"></iframe> -->
Career Planning2021-07-24T00:00:00+00:00https://yan-qi.github.io/2021/07/24/Career_Planning<p><img src="/public/sunset-view.jpg" alt="Long View Approach - Career Planning" /></p>
<p>In the past 20 years, the human life expectancy has been improved significantly, and the retirement age has been rising. In other words, the retirement is starting later but lasting longer. People used to think that careers would be over when they are around 40s. However, it may not be even at the halfway point. Actually people tend to underestimate the length of a career. Therefore it is necessary to plan for a long career journey, especially if a successful career is concerned.</p>
<p>Generally careers can be divided into three stages:</p>
<ol>
<li>Start strong in the first 15 years of the career;</li>
<li>Reach high in the middle;</li>
<li>Go far near or even beyond retirement.</li>
</ol>
<p>The book <strong>The Long View</strong> tries to introduce us to a set of career mindset, framework, and tools, to help us learn how to collect the ‘fuel’ to achieve our career goals at the different stages. As a result of reading and learning, I made a presentation based on the book, hopefully it could highlight the main points.</p>
<ul>
<li>
<h3 id="a-summary-on-the-long-view-career-stategies-to-start-strong-reach-high-and-go-far"><a href="/public/presentation/LongView">A Summary on <em>The Long View: Career Stategies to Start Strong, Reach High, and Go Far</em>.</a></h3>
</li>
</ul>
Clean Architecture2021-06-22T00:00:00+00:00https://yan-qi.github.io/2021/06/22/Clean_Architecture<p><img src="/public/nature-1.jpeg" alt="Built with simple rules: water, air, sun, gravity" /></p>
<p>Software development has many similarities with building construction. There are a few of rules that seem simple, like the physics of gravity in the real physical world or the single responsibility principle (SRP) in the programming. However, not all developers can necessarily use them well, especially in the complex scenario. An architect should have a good understanding of those principles first, and grow a pair of sharp eyes to see through the complexity, such that she can apply those rules to achieve a clean architecture.</p>
<p>Uncle Bob in his book, <strong>Clean Architecture: A Craftman’s Guide to Software Structure and Design</strong> gives a detailed description on those principles. More importantly he tries to explain how <em>clean architecture</em> can be achieved with the help of them. As a result of reading and learning, I made a presentation based on the book, hopefully it could highlight the main points.</p>
<ul>
<li>
<h3 id="a-summary-on-clean-architecture-a-craftsmans-guide-to-software-structure-and-design"><a href="/public/presentation/CleanArchitecture">A Summary on <em>Clean Architecture: A Craftsman’s Guide to Software Structure and Design</em>.</a></h3>
</li>
</ul>
Clean Agile2020-10-12T00:00:00+00:00https://yan-qi.github.io/2020/10/12/Clean_Agile<p>Teamwork is everywhere, especially important in the human society. For example in the software project, it cannot be emphasized too much as long as more than one person get involved. There can be many aspects affecting the performance of teamwork. The keys are about <strong>communication</strong> and <strong>collaboration</strong>.</p>
<p>In my not-so-long career as software engineer, I found one of biggest challenges that prevent developers from delivering a successful software product is due to the communication gap between them and their business partners. Many failures could be avoided if both parties are able to sync-up earlier. However, timing is not the only factor. The communication may lead nowhere if the common language is absent. The business people often use a human language, like English to describe what they need, or the specifications; whereas, developers prefer more formal languages, typically thinking of translating the business specifications into code (e.g., acceptance tests). This difference clearly causes a challenge.</p>
<p>Agile tries to address the challenge faced by a small group of software developers with a feedback-driven approach. Therefore a software project is composed of many small cycles, each of which aims to provide a working or deliverable product that their business partners can review and both sides would discuss and decide what to do next. Instead of particular rules or steps, Agile emphasizes a set of principles and values, and encourages to cultivate a culture out of those. The book by Robert C. Martin, <strong>Clean Agile: Back to Basics</strong> gives a very clear explanation on these values and principles. Furthermore, it provides quite a few guides for applying Agile in practice. As a result of reading and learning, I made a presentation based on the book, hopefully it could highlight the main points.</p>
<ul>
<li>
<h3 id="a-summary-on-clean-agile-back-to-basics"><a href="/public/presentation/CleanAgile">A Summary on <em>Clean Agile: Back to Basics</em>.</a></h3>
</li>
</ul>
RecordBuffer - A Data Serialization Approach in DataMine2016-01-03T00:00:00+00:00https://yan-qi.github.io/2016/01/03/RecordBuffers<p>Data serialization is a basic problem that every data system has to deal with. To provide an efficient solution, a data serialization approach should be able to arrange the data into a compact binary format which is independent of any particular application. Nowadays there are some open-source projects on the data serialization system, such as <a href="http://avro.apache.org/">Avro</a>, <a href="https://developers.google.com/protocol-buffers/">Protocol Buffer</a> and <a href="https://thrift.apache.org/">Thrift</a>. These projects have reasonable large communities and are rather successful in different application scenarios. They are generally applicable in different applications, more or less following the similar ideas when serializing the data and providing APIs for message exchanges. These approaches are for general purpose and usually working well. Additionally it is also possible for them to work with other data formats such as <a href="https://parquet.apache.org/">Parquet</a> to provide variety of options.</p>
<p>However, they could do better when applied to the data with nested structure. For example, the in-memory record representation may consume the similar memory even though only a few of columns have meaningful values. On the other hand, it is possible to improve the deserialization performance with the help of index.</p>
<p><a href="http://www.turn.com/digital-hub/product-suites#analytics">DataMine</a>, the data warehouse of Turn, exploits a flexible, efficient and automated mechanism to manage the data storage and access. It describes the data structure in <a href="https://github.com/turn/DataMine/blob/master/doc/DataMine_IDL.md">DataMine IDL</a>, follows a code generation approach to define the APIs for data access and schema reading. A data encoding scheme, <a href="https://github.com/turn/DataMine/tree/master/recordbuffers">RecordBuffer</a> is applied to the data serialization/de-serialization. RecordBuffer depicts the content of a table record as a byte array. Particularly RecordBuffer has the following structure.</p>
<p><img src="https://github.com/turn/DataMine/raw/master/recordbuffers/doc/res/record_buf.png" width="1100" /></p>
<ul>
<li><em>Version No.</em> specifies what version of schema this record uses; it is required and takes 2 bytes.</li>
<li>The number of attributes in the table schema is required and takes 2 bytes.</li>
<li><em>Reference section length</em> is the number of bytes used for the reference section; it is required and takes 2 bytes.</li>
<li><em>Sort-key reference</em> stores the offset of the sort key column if exists; it is optional and takes 4 bytes if exists.</li>
<li>The number of collection-type attributes uses 1 byte for the number of collections in the table, and it is required.</li>
<li><em>Collection-type field references</em> store the offsets of the collections in the table sequentially; note that the offset of an empty collection is -1.</li>
<li>The number of non-collection-type field reference uses 1 byte for the number of non-collection-type columns which have hasRef annotation.</li>
<li><em>Non-collection-type field references</em> sequentially store the ID and offset pair of columns with hasRef annotation if exist.</li>
<li><em>Bit mask of attributes</em> is a series of bytes to indicate the availability of any attributes in the table.</li>
<li><em>Attribute values</em> store the values of available attributes in sequence; note that the sequence should be the same as that defined in the schema.</li>
</ul>
<p>Different from other encoding schemes, RecordBuffer has a reference section which allows index or any record-specific information. Having index in the reference section can locate the field (like sort-key) directly, simplifying data de-serialization significantly. On the other hand, the frequently-accessed derived values can be stored in the reference section to speed up data analytics. This is quite useful when nested data are allowed. For example, a summary on the nested attribute values can be derived and stored in the reference section, such that the deserialization of the nested table (usually very costly) can be avoid when applying aggregation to the attribute.</p>
Deploy Application As Data (DaaD)2015-11-11T00:00:00+00:00https://yan-qi.github.io/2015/11/11/daad<p>The distributed computing stack commonly uses a layered structure. A functionally independent component is defined on each layer, and different layers are connected through APIs. This structure makes it quite easy for system to scale.
One example of such a system can be composed of local OS/FS, distributed FS, resource management system, distributed computing frameworks, and applications. Nowadays, the HDFS is often used as the distributed file system. Yarn is one example of the resource management systems, whereas Spark can be one of promising computing frameworks.</p>
<p><img float="center" src="http://thinkingscale.com/public/stack.png" width="600" height="" border="0" alt="" /></p>
<p>The key of distributed computing is to run the same code on the different parts of data then aggregate the results into the final solution. Particularly, the data are first partitioned, replicated and distributed across the cluster nodes. When an application is submitted, the resource management system decides how much resource is allocated and where the code can be run (usually on the nodes where the input data are stored, so called <em>data locality</em>). The computing framework devises a job or work plan for the application, which may be made up of tasks. More often than not a driver is issued in the client side (e.g., lines in green) or by a worker (e.g., lines in orange). The driver initializes the job and coordinates the job and task execution. Each task is executed by a worker running on the node and the result can be shuffled, sorted and copied to another node where the further execution would be done.</p>
<p><img float="center" src="http://thinkingscale.com/public/distributed_computing.png" width="600" height="" border="0" alt="" /></p>
<p>There are two popular ways to deploy the application code for execution.</p>
<ol>
<li><strong>Deploy the code in the cluster nodes</strong> - This approach distributes the application to every node of the cluster and the client server. In other words, all involved nodes in the system have a copy of the application code. It is not common, but in some cases it is necessary when the application depends on the code running in the node. The disadvantages of this approach are obvious. First the application and the computing system have a strong coupling, such that any change from either side could potentially cause issues to the other. Second, the code deployment becomes very tedious and error prone. Think about the case where some nodes in the distributed environment fail during the code deployment. The state of cluster becomes unpredictable when those failed nodes come alive with the old version of code.</li>
<li><strong>Deploy the code in the client only</strong> - A more common strategy is to deploy the application code to client server only. When running the application, the code is first distributed to the cluster nodes with some caching mechanism, such as the distributed cache in the Hadoop. This simple but effective approach could decouple the application and its underneath computing framework very well. However when the number of clients is large, the deployment can become nontrivial. Also if the size of application is very large, the job may have a long initialization process as the code needs distributing across the cluster.</li>
</ol>
<h3 id="daad-deploy-application-as-data">DaaD: Deploy Application As Data</h3>
<p>In the distributed computing, the code and the data are traditionally treated differently. The data can be uploaded to the cloud and then copied and distributed by means of the file system utilities. However the code deployment is usually more complex. For example the network topology of application nodes must be well defined beforehand. A sophisticated sync-up process is often required to ensure the consistency and efficiency, especially when the number of application nodes is large.</p>
<p>Therefore if the code can be deployed as data (i.e., DaaD), the code deployment can be much simpler. The DaaD is a two-phase process.</p>
<ol>
<li>The application code is uploaded to the distributed file system just as common data files.</li>
<li>When running the application, a launcher is used to load the code from the distributed file system, store the code in the distributed cache accessible to all nodes and issue the execution on node.</li>
</ol>
<p>Clearly, the launcher is required to deployed to the client where the execution request is submitted. It should be independent of any specific applications. An example can be found at <a href="https://github.com/turn/DaaD">DaaD</a><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>
<p>The improvement by the DaaD can be significant.</p>
<ul>
<li>
<p>The deployment becomes much simpler. Often the code can be uploaded to the distributed file system through a simple command. Then the code replicating and distributing can be achieved <em>automatically</em> through the file system utilities.</p>
</li>
<li>
<p>The launcher can be defined as a simple class or executable file, which is quite stable. It is trivial to distribute it to the application node.</p>
</li>
<li>
<p>The application code is loaded for execution only if an execution request is issued. Namely the code is actually copied and distributed in the lazy way. Importantly the latest version of code is always used.</p>
</li>
<li>
<p>Having no local copy can avoid of code inconsistency problem.</p>
</li>
<li>
<p>It makes it much easier for different code versions coexist in production. Image the scenario where it is necessary to run the same application with different code versions for A-B test.</p>
</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>DaaD - <a href="https://github.com/turn/DaaD">https://github.com/turn/DaaD</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Highly Efficient User Profile Management in Petabyte-Scale Hadoop-based Data Warehouse2015-10-10T00:00:00+00:00https://yan-qi.github.io/2015/10/10/turn_db<p>Traditionally the user profile data are stored and managed in the data warehouse. The user profile can be frequently updated to reflect the changes of user attribute and behavior. Moreover, it is critical to support a fast query processing for profile analytics. These basic requirements become challenging in the big data system, as the scale of profiles have reached a point (over petabytes) that the traditional data warehousing technology can barely handle.</p>
<p>The challenges mainly come from two aspects: the updating process is costly as the daily change can be huge and on the other hand the query performance must keep being improved as the input is rapidly increasing. For example the daily changes in the big data system can be more than terabytes, which makes data updating very expensive and often slow (big latency).</p>
<p>The traditional data warehouse provides a standard solution to the problem of user profile management and analytics, however the data scale that it normally deals with is not satisfactory. The data size in the traditional data warehouse is usually up to tens of terabytes. In the big data system, the petabyte-scale data store is quite normal, such that it is difficult for the traditional data warehouse to store and manage so many user profiles in an efficient way.</p>
<p>At Turn, we come up with an integrated solution to the highly efficient user profile management. Basically we build a data warehouse based on the Hadoop file systems. The size of data stored in the data warehouse can be on a petabyte scale. The data can be stored in row-based or columnar layouts in the data warehouse, such that it can provide both an efficient updating performance and a fast profile analytics.</p>
<h3 id="architecture">Architecture</h3>
<p>The architecture of the profile management system is mainly composed of the following components:</p>
<ol>
<li>Operational Data Store (ODS)</li>
<li>Analytics Data Store (ADS)</li>
<li>Parallel ETL</li>
<li>Cluster Monitor</li>
<li>Database</li>
<li>Query Dispatcher</li>
<li>Analytics Engine</li>
</ol>
<p><img style="float: center" src="http://thinkingscale.com/public/OA.png" width="700x" /></p>
<p>Every day there is a huge amount of data generated from the front-end system, like ad servers. The Parallel ETL (PETL) is a process running in the cluster to collect and process these data in parallel and store them in the Operational Data Store (ODS). The status of the PETL, like what data have been processed, is collected by the cluster monitor and kept in the Database.</p>
<p>The ODS is a row-based storage, as it must support a fast ingestion of the incoming updates. The key-value store can be faster in the data updating, however it is much slower for data analytics. In the case of profile management, the update can be merged into the profile store (i.e. ODS) regularly in batch mode. The cluster monitor collects the status of the ODS, like what data have been stored, what queries have been executed etc, and stores them in the Database.</p>
<p>The Analytical Data Store (ADS) provides a better solution to the data analytics. In the ADS, data are stored in columns. Comparing with its row-based counterpart (i.e., the ODS), the columnar store has a better compression ratio when storing the data, therefore has smaller data size. More importantly, only the data of interested are loaded and read when executing the user query on the columnar store. The disk I/O cost saving is almost optimal, therefore in the I/O intensive application it can achieve faster execution time in orders of magnitude comparing with the ODS. The cluster monitor collects the status of the ADS, like what data have been stored, what queries have been executed etc, and stores them in the Database.</p>
<p>As the data have different layouts in the ODS and the ADS, there is a data conversion from the ODS to the ADS. The conversion result is merged with the data in the ADS. Note that the ADS may not have the latest update in the ODS because the data conversion is done in batch mode. The cluster monitor collects the status of the conversion job and stores it in the Database.</p>
<p>When a user query is submitted, it is first stored in the query table of the Database. The query dispatcher keeps scanning the Database to (1) decide which query to execute next based on the factors, such as the waiting time, the query priority, etc, (2) decide which data store (i.e., ODS or ADS) to use for the query execution based on the data availability and the cluster resource availability, and (3) send the query job to the analytics engine for query execution.</p>
<p>Both the ODS and the ADS have an analytics engine to execute the query job from the query dispatcher.</p>
<p>The Database stores all the status information, including (1) the submitted query job, (2) the status of cluster, (3) the status of data stores (the ADS and the ODS), (4) the status of jobs running in the cluster, including the PETL, the converter, etc.</p>
Disaster Recovery across Data Centers2015-06-04T00:00:00+00:00https://yan-qi.github.io/2015/06/04/Multi_Site_Data_Warehouse<p>In the era of BigData, the data storage becomes so large that recovery from a disaster, such as the power outage of a data center, becomes very difficult. The traditional transaction-oriented data management system relies on a write-ahead commit log to record the system state and the recovery process will work only if the log processing is faster than the incoming change requests. In other words, commit log based approach hardly works for big data system where terabytes of non-transactional daily changes are norms.</p>
<p>In Turn, we exploit a geographically apart master-slave architecture to support high availability (HA) and disaster recovery (DR) in the large scale Hadoop-based DWS (Data Warehouse System).</p>
<!--
<img src="http://thinkingscale.com/public/multi-site.png" width=800x/>
-->
<p>The master and the slave are located in the different data centers that are geographically apart. Functionally, the slave is like a mirror of the master. The master or slave is composed of a Hadoop cluster, a relational database, an analytics engine, a cluster monitor, a query dispatcher, a parallel ETL component, and a console. Not all of these components in the slave are active. For instance, the <em>cluster monitor</em> in the slave is standing by, whereas its <em>analytics engine</em> should be active to accept query jobs.</p>
<p>Data replication happens from the master to the slave to assure the data consistency between the Hadoop clusters, and database replication reflects any change on the master database to the slave database. The master and slave are connected with a dedicated high-speed WAN (Wide Area Network).</p>
<p>The master-slave architecture makes DR and HA possible when one of the data centers fails. Additionally the workload balancing between master and slave optimizes the query throughput<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>
<h3 id="failover-and-disaster-recovery">Failover and Disaster Recovery</h3>
<p>Failure is common in the large storage system. It could be due to hardware failure, software bug, human error etc. The master-slave architecture makes the HA and DR possible and simple in the petabyte-scale data warehouse system at Turn.</p>
<h4 id="network-failure">Network Failure</h4>
<p>When the WAN fails completely, a performance degrading can occur as all queries will be dispatched to the master only. The data and database replications also stop. Fortunately nothing special has to be done with the Hadoop cluster when the WAN comes back to normal, because the data replication process will catch up with the missing data of the slave. A sync-up operation is triggered to synchronize the slave database with the master database.</p>
<h4 id="hadoop-cluster-failure">Hadoop Cluster Failure</h4>
<p>In case of the Hadoop cluster failure in the slave, the DWS keeps working as the master has everything untouched. It is a little complex when the master loses its Hadoop cluster, as it implicates a master-slave swapping. Particularly, all components of the slave become active and take over the interrupted processes. For instance, <em>parallel ETL</em> becomes active first, starting to ingest data. Importantly during the swapping, the DWS keeps accepting and running the user query submission.</p>
<p>When the failed Hadoop cluster comes back, the data replication process can help to identify and copy the difference from the current master to the slave. Depending upon the downtime and data loss caused by the failure, it can take up to days to complete the entire recovery. However, certain queries can be executed in the recovering cluster. Furthermore based on the query history in the relational database, the hot-spots of data can be recovered first.</p>
<h4 id="data-center-failure">Data Center Failure</h4>
<p>It is rare but fatal if one of the data centers fails. At Turn the DWS tolerates a failure of either the master or the slave. When the slave data center is down, the performance is degraded because only one of the Hadoop clusters is available and all workload are moved to the working one. The data recovery is trivial because the replication processes will catch up the difference once the failure is fixed.</p>
<p>If the master data center fails, the stand-by services in the slave will become active right away. There is a chance that data or query may lose if the failure happens in the middle of the data replication. However, the loss can be mitigated if the replication is scheduled to run more frequently. After all services are active, the slave becomes the master. When the failed data center is recovered, it will run as a slave. The data replication is issued to transfer the difference over. Before the database replication starts, a database sync-up process is required to make the new slave have the same content in its relational database.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Any query job can be executed in either Hadoop cluster, as long as its input is available. Therefore, it is possible to balance the workload between clusters. Specifically given a query submission, <em>query dispatcher</em> first checks the input availability on both clusters. If the input is available in both, the <em>query dispatcher</em> assigns the query to the cluster which is less busy. In most case, the query result is small and the cost of reading it back from the slave is negligible. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Efficient Distributed Copy across Data Centers2015-05-20T00:00:00+00:00https://yan-qi.github.io/2015/05/20/DistCPPlus<blockquote>
<p>DistCp is a tool used for large inter/intra-cluster data transfer. It uses Map-Reduce to effect its distribution, error handling and recovery, and reporting. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
</blockquote>
<p>Currently DistCp is mainly used for transferring data across different HDFSs (HaDoop File Systems). The HDFSs can sit in the same data center where the data flow through LAN (Local Area Network) or in the different data centers connected by WAN (Wide Area Network). Basically, DistCp issues Remote Procedure Calls (RPCs) to the name nodes of both source and destination, to fetch and compare file statuses, to make a list of files to copy. <img style="float: right" src="http://thinkingscale.com/public/hadoop-mirror.png" width="500x" /> The PRC is often very expensive if the name nodes are located in the different data centers, for example, it could be up to 200x slower than the case within the same date center in our experience. DistCp may issue the same RPCs more than once, dragging down the entire performance even further. DistCp doesn’t support regular expression as input. If the user wants to filter and copy files from different folders, she has to either calculate a list of file paths beforehand or execute multiple DistCp jobs. Moreover DistCp allows preserving the file attributes during the data transfer. However, it does not preserve time stamp of file, which is quite important in some applications.</p>
<p>To address these problems with DistCp, we introduced an enhanced version of distributed copy tool, DistCp+<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup>. Particularly DistCp+ makes it easier and faster to transfer a large amount of data across data centers. Comparing with DistCp, DistCp+ introduces improvement in the following aspects:</p>
<h3 id="support-regular-expression">Support Regular Expression</h3>
<!--More flexibility: the regular expression is allowed as the input, enabling filtering and transferring files from different source folders in one job;-->
<p>A regular expression is a sequence of characters that forms a search pattern. It has been wildly used in the text processing utilities, for example the command <em>grep</em> in Unix. The regular expression used by DistCp+ is based on the syntax of regular expression in Java <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup> with minor changes. To use the regular expression option with DistCp+ you must specify 2 parameters: a root URI and a path filter. The path filter follows normal regular expression rules but treats all ‘/’ tokens in a special way. The ‘/’ token is used as a delimiter and the regular expression provided is split into multiple sub-expressions around this token. Each sub-expression is used as a separate path filter for a specific depth relative to the URI with the leftmost sub-expression being used first.</p>
<blockquote>
<p>For example, assuming you specify “/logs/” as the root URI and provide a regular expression of “server1|server2/today|yesterday”. The regular expression will be split into the 2 sub-expressions “server1|server2” and “today|yesterday”. Then DistCp+ will traverse the file system starting at the root (“/logs/”) and use any file that matches the first sub-expression (“server1|server2”). Folders are recursively expanded, but at each new depth in the file system the next sub-expression is used as the path filter. Using this example provided, you can match such files as “/logs/server1/yesterday” and “/logs/server2/today”, but it will not match something like “/logs/yesterday”. Also note that if a folder matches the last path filter, the entire folder is used as input instead of being recursively traversed.</p>
</blockquote>
<h3 id="cache-file-status">Cache File Status</h3>
<!--2. Better scalability: the caching is exploited to minimize the number of RPCs for comparing data between HDFs;-->
<p>When a DistCp job copies a large number of files especially across geographically distant data centers, it usually has a very long setup time as several RPCs are issued to collect the file status from both sides. However, the cost of RPC is very high especially when it is through WAN. DistCp repetitively issues RPCs to get individual directory or file’s status object. These RPCs either overlap with previous ones or can be combined into fewer RPCs. To reduce the cost, a cache of file status is created in the early stage, when directory level RPC is used to get all file status under the very directory in one RPC. Then RPC is necessary only if a cache miss occurs in the following stages. For tens of thousands file transferring, we observe a significant improvement on the end-to-end time cost.</p>
<h3 id="keep-time-stamp">Keep Time Stamp</h3>
<!--3. Preserve the temporal property: the time stamp of file to transfer can be preserved.-->
<p>DistCp supports to preserve the file status, including block size, replication factor, user, group and permission. However, it does not keep the time stamp (or last modified time stamp) of the file, which is important especially when we use “-update” option to skip files without any change. Checking CRC (Cyclic Redundancy Check) of the file is an alternative, but the CRC computation of files is too high to be practical for large data transfer. Comparing the file size may not be accurate as some changes do not change the size of files. Therefore the time stamp is better to decide if an update is necessary. Particularly, the file has its time stamp preserved after copied if necessary. When “-update” option is specified, DistCp+ compares the time stamps of files on different clusters to decide if the file is included.</p>
<p>In <a href="http://www.turn.com">Turn</a>, DistCp+ has been used to transfer data among different data centers regularly. A DistCp+ job can usually copy thousands of files from different folders and the data volume can be terabytes.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Hadoop DistCp: <a href="http://hadoop.apache.org/docs/current1/distcp.html">http://hadoop.apache.org/docs/current1/distcp.html</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>DistCp+: <a href="https://github.com/turn/DistCPPlus">https://github.com/turn/DistCPPlus</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Lesson: Regular Expressions, <a href="http://docs.oracle.com/javase/tutorial/essential/regex/">http://docs.oracle.com/javase/tutorial/essential/regex/</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Nested Data in DataMine2014-11-03T00:00:00+00:00https://yan-qi.github.io/2014/11/03/Nested_Data_in_DataMine<p>After joining <a href="http://www.turn.com">Turn</a>, I started to work on <a href="http://www.turn.com/digital-hub/product-suites#analytics">DataMine</a>, a peta-byte scale data warehouse built upon Hadoop. One of the most important features of DataMine is that it can effectively support the nested data structure.</p>
<p><img style="float: left" src="http://thinkingscale.com/public/tables.jpeg" width="200x" />
Comparing with the traditional relational data model, the nested relational data model allows the value in a table to be a set or a hierarchical structure. While stored in the database it cannot be simply normalized, instead it is depicted in the <a href="http://en.wikipedia.org/wiki/Database_normalization#Non-first_normal_form_.28NF.C2.B2_or_N1NF.29">non-first normal form</a> (i.e., non-1NF). In other words, the constraint that <em>all domains must be atomic</em> is not satisfied. Clearly it is a drawback if the data needs updating frequently. Whereas the nested relational data model makes the data representation more natural and efficient, and importantly it can eliminate join operations while reading. From this point of view the nested data structure can work well with data warehouse, where <a href="http://en.wikipedia.org/wiki/Online_analytical_processing">OLAP (OnLine Analytical Processing)</a> is more common than <a href="http://en.wikipedia.org/wiki/Online_transaction_processing">OLTP (OnLine Transaction Processing)</a>.</p>
<p>DataMine exploits a nested relational data model. Particularly it allows the domain of one attribute of a table to be another table. One typical use case with DataMine is to store the on-line user profiles in a table with nested tables. Each record is composed of many user attributes, such as ID, time stamp, campaign information etc. Some attributes like campaign information can be further nested tables. An example can be found below.</p>
<p><img src="http://thinkingscale.com/public/nested_data_1.png" width="650x" /></p>
<p>To enable an efficient data access or query processing, DataMine implements an unnesting operation that flattens a record into a set of records. Thus the existing relational query execution techniques can be applied. Actually the unnesting operation tends to transform the table with nested data structure from non-1NF to 1NF. For example, the table above can be unnested as the following result.</p>
<p><img src="http://thinkingscale.com/public/nested_data_2.png" width="650x" /></p>
<p>When the table sizes become very large, it is not efficient to support JOIN between tables. This can be one reason why fewer tables are strictly normalized with their sizes increasing. Keeping everything within a single table can eliminate some JOINs. On the other hand, the correlation analytics at the record level is possible. DataMine allows JOIN between nested tables within a query through implementing special <em>LIST</em> functions.</p>
<p>A table in DataMine can have billions of records, and the nested table of a record can have millions of nested table records. Scalability is always the first consideration in the design and development. Now, DataMine stores its data in the HDFS (HaDoop File System). Depending upon the requirements, the data can be in row based or column based. Columnar store is a good fit for the use cases where partial deserialization is common, whereas row-based store can keep a performance balance between reading and writing.</p>
<p>From my experience, many applications in the big data era share some common features:</p>
<ul>
<li>Data normalization is not necessary. Nested data model is a natural choice when a hierarchy gets involved in the data structure.</li>
<li>The data are written once and read multiple times. In other words, the data updating is not often a requirement.</li>
<li>The complex data analytics can be implemented efficiently through applying JOIN operations among nested tables inside a record.</li>
</ul>
<p>Certainly DataMine is a good fit in these applications.</p>
Spirits in Action Never Die2014-09-19T00:00:00+00:00https://yan-qi.github.io/2014/09/19/Spirits_in_Action_Never_Die<p>Recently an <a href="http://readwrite.com/2014/08/11/why-learn-php">article</a> in <a href="http://readwrite.com">readwrite.com</a> drew my attention. It talks about PHP, a programming language created in 1994. It reminds me of those days when I studied PHP, making me think a lot about the programming languages and the ideas underneath. (Interestingly both PHP and Hadoop employ an elephant in the logo)</p>
<p>The first programming language that I learned is Pascal. Since then, there have been a long list of programming languages that I have ever used. Few of them were taught in my college such as Pascal, C and ASM. Others were mostly self taught, when they were regarded as necessary. For instance, when I did research on the high availability systems, I worked on some programs in Erlang, a programming language not commonly recognized.</p>
<p>The story of PHP might sound a little funny. PHP, as you may know, is a scripting language that allows programmers to build dynamic Web pages. Clearly, it doesn’t seem right if you apply it in the field of, let’s say, machine learning. <img style="float: left" src="http://thinkingscale.com/public/php-1.png" width="250x" /> When I was a senior at college, one of the post-doctors in the lab where I volunteered assigned me a job: to implement a machine learning algorithm in PHP. According to her, the program would be easily deployed as a Web service if it were implemented in PHP. I was too fresh to think about it again before taking action. In the following weeks, I tried my best to learn and use PHP to implement the algorithm. However, no matter how hard I tried, the program I got did not work: it was too slow to get completed. In the end I had to re-implement the algorithm in C, which proved to be the right choice.</p>
<p>The bright side of this story is that the effort I took on PHP was not a waste. I learned how to build a website with PHP. I am also impressed by its simplicity and flexibility. PHP may not be perfect, but it is simple and powerful enough in most cases. Furthermore, learning how PHP works also deepened my understanding of web programming. Other lessons that I learned about programming also include:</p>
<ul>
<li>Every programming language has its pros and cons. There is no such a thing as ‘silver bullet’.</li>
<li>It is necessary to have a thorough understanding of problem in the field in order for choosing the right programming language.</li>
<li>There is a huge gap between knowing a language and using it well. No shortcut! But keeping practicing can always lead us closer to be a master.</li>
</ul>
<p>The programming language is quite different from the human language. Rather than often a communication vehicle, the programming language is more like a tool for solving computational problems. Obviously as there can be more than one solution or different ways to attack the problem, no programming language would be the <em>ONE</em>. In some scenario, one may be better than the other, but in others not necessarily. Sometimes it is amusing to see people arguing against each other as to which programming language is the best. In many cases, people simply ignore the problems the languages try to solve, instead they would rather pay more attention to the features or functionalities. Remember I was one of them once upon a time, when I started to learn Java. I thought Java would replace C++ or C some day as it is a write-once-run-anywhere language. However, the day never comes because Java and C++ are excellent only in some particular fields.</p>
<p>Instead of focusing on the programming languages, I believe it is more helpful to think of the problem. Why are there more than one programming languages to attack the same problem? Is it a tricky problem? What is the challenge behind? Is there anything we can do to improve any exiting tool? Not only does thinking in this way pull us out of the pointless argument, but leads us to better ourselves.</p>
Open Source, Open Minds2014-09-02T00:00:00+00:00https://yan-qi.github.io/2014/09/02/Open_Source_Open_Minds<p>When I was in the graduate school of ASU, I worked with my Ph.D adviser, <a href="http://aria.asu.edu/candan/">Dr. K. Selçuk Candan</a> on data integration. One goal of my work was to develop efficient algorithms to capture conflicts in the data integration and provide effective schemes to resolve them. In our proposal<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup>, we represent the integration result as a loop-less weighted directed graph, thus a set of resolutions can be collected by searching for the top-k shortest paths.</p>
<p>Clearly finding the top-k shortest paths is a classical graph problem, and <a href="http://en.wikipedia.org/wiki/Yen's_algorithm">Yen’s algorithm</a> provides the commonly-known solution. A recent and elegant discussion is proposed in <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">2</a></sup>.
However, I needed an implementation of this algorithm in C++ or Java. After searching Google, I could find neither. So I decided to <em>do it myself</em>. The initial implementation was in Java, used in our <a href="http://sigmod07.riit.tsinghua.edu.cn/acceptedPaperForSIGMOD.shtml">SIGMOD</a> demo<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">3</a></sup>.</p>
<p>Afterwards, I created a Google project<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">4</a></sup> to share my implementation as I believe it is great if anyone can benefit from my work. Later on I added its C++ implementation per people’s requests. Actually I was surprised as so many people were interested in the project. Many of them did really apply the code to their work. More importantly, they gave me quite a few good feedbacks.
<img style="float: right" src="http://thinkingscale.com/public/open_source.jpg" width="110x" /></p>
<ul>
<li>My implementation had bugs, which were not captured in my tests. Some feedbacks helped me <strong>identify and fix</strong> most of them. Actually one <a href="https://github.com/yan-qi/k-shortest-paths-java-version#a-note-about-a-bug">bug</a> was so tricky, because it’s hidden in the algorithm<sup id="fnref:4:1" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">2</a></sup>, that it was almost impossible to dig it out without users’ feedbacks.</li>
<li>Some developers would like to <strong>contribute back</strong>, joining me with bug fixing, code refactoring, and even new implementation. For instance, Vinh Bui added the C# implementation.</li>
<li>A small <strong>community</strong> was built up around this project, to help each other and encourage me work harder towards a better software.</li>
</ul>
<p>Up to this point, the project became not only my own work, but one involving <em>many minds</em>.
I realized that the core of open source is <em>sharing</em>, but it should not be limited to the code only. It is more about <em>the human minds</em>, such as the developer experience and the user feedback.
On the other hand, the <em>sharing</em> should be bi-directional, thus the roles of the user and the contributor are interchangeable.
Additionally <em>sharing</em> can draw people together as a <em>community</em>, which in turn will definitely lead <em>sharing</em> up to a higher level.
In the sense, the philosophy advocated by open source, I believe, can be summarized as <em>open minds</em>.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:2" role="doc-endnote">
<p><a href="http://dl.acm.org/citation.cfm?doid=1247480.1247499">FICSR: feedback-based inconsistency resolution and query processing on misaligned data sources. SIGMOD Conference 2007</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p><a href="http://link.springer.com/article/10.1007%2Fs10288-002-0010-2">A new implementation of Yen’s ranking loopless paths algorithm</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:4:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p><a href="http://dl.acm.org/citation.cfm?doid=1247480.1247639">Integrating and querying taxonomies with quest in the presence of conflicts. SIGMOD Conference 2007</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p><a href="http://thinkingscale.com/k-shortest-paths-cpp-version/">GitHub Project: An implementation of K-Shortest Path Algorithm</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
When Hadoop met Teradata2014-08-21T00:00:00+00:00https://yan-qi.github.io/2014/08/21/When_Hadoop_met_Teradata<p>In 2009, I got my first full-time job with <a href="http://www.teradata.com">Teradata</a>. It was a R&D position. Since then, I have migrated my focus to the parallel DBMS and the distributed computation.</p>
<p><a href="http://hadoop.apache.org/">Hadoop</a> was rising as a super-star in the big data movement.
<img style="float: right" src="http://thinkingscale.com/public/hadoop-logo.jpg" />
Comparing with proprietary software, Hadoop is open source (i.e., <em>FREE</em>) and has an very active community, getting favors from small or middle business.</p>
<p>However, Teradata sells its product and service, in most cases, very expensively. Most of its clients are big companies.
As one of the important players in the field, Teradata does not want to be left behind in the race, even though it does actually lead in many aspects. One big move occurred in 2011, when Teradata acquired <a href="http://en.wikipedia.org/wiki/Aster_Data_Systems">Aster Data Systems</a>, to address the challenge.</p>
<p>The common idea shared by both Hadoop and Teradata EDW (Enterprise Data Warehouse) is that data are partitioned across multiple nodes thus the computation can be done in parallel.
<img style="float: right" src="http://thinkingscale.com/public/teradata-logo.jpg" width="280x" />
Therefore one of my work then was to explore the collaboration opportunities between them.</p>
<ul>
<li>
<p>One application was to load data from the HDFS into the Teradata data warehouse<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. It enables a Hadoop-Teradata EDW co-existing environment: Teradata EDW is the first tier as the data storage and for the main data analytics, whereas Hadoop is the second tier as the intermediate data storage and processing. Here Hadoop is mainly used as part of an ETL process, instead of serving for data analytics.</p>
</li>
<li>
<p>On the other hand, it is possible to run the Hadoop MapReduce job on data stored in the Teradata EDW, through the customized input format classes<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. Thus Hadoop provides a more flexible but maybe less efficient solution to the complex data analytics, comparing with the UDF (User Defined Function).</p>
</li>
<li>
<p>One interesting technique we developed is to optimize the data assignment from the HDFS (Hadoop File System) to the Teradata EDW<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>, when the Hadoop and Teradata EDW are located in the same hardware cluster. It was exciting to see the problem can reduce to <a href="http://en.wikipedia.org/wiki/Minimum-cost_flow_problem">min-cost flow</a> in the bipartite network.</p>
</li>
</ul>
<p>These projects had a long-lasting influence on my career in the following years. (1) They led me into a field where <strong>thinking at scale</strong> is necessary.
(2) Another important lesson I learned is that some idea may come back and forth, like that of partitioning data for parallel computing, but people do pay more attentions to the problem it could solve. Therefore, problem-driven can be a better strategy in the field.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p><a href="http://dl.acm.org/citation.cfm?id=1989323.1989440&coll=DL&dl=GUIDE&CFID=537999572&CFTOKEN=72178896">A Hadoop based distributed loading approach to parallel data warehouses</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a href="http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw">Hadoop MapReduce Connector to Teradata EDW</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p><a href="http://www.google.com/patents/US20130173666">Techniques for data assignment from an external distributed file system to a database management system</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Met Neural Network2014-08-16T00:00:00+00:00https://yan-qi.github.io/2014/08/16/met_the_neural_network<p>When I was in <a href="http://en.cs.ustc.edu.cn/">college</a>, I had a chance of working with a post-doc in our
department on the contented-based image retrieval (CBIR)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. Clearly it was a
hard problem. For example, not like text, it is challenging to capture the user
real intention. Moreover the image object recognition has been always an open problem.
Therefore, we proposed a different approach from the traditional submit-question-return-answer
search strategy. It is an interactive retrieval. In other words, it might not be able to return the
satisfactory result first, instead it invites the user to give its
feedback if the result is not perfect. An improved result will be given
in the following taking the user feedback into account, until the user
gets the right image, conceptually.
The tools serving our purpose included <a href="http://en.wikipedia.org/wiki/Backpropagation">the BP neural network</a> and the interactive genetic algorithm
(<a href="http://en.wikipedia.org/wiki/Interactive_evolutionary_computation#IGA">IGA</a>)<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>
<p>Then <a href="http://www.image-net.org/">ImageNet</a> didn’t exist in those years, so in our experiments, we had to crawl
many images from the Internet. Whereas the variety was a problem as most
of the images we got were landscape photos. You can image that the
quality of our work was kind of limited.
However, working on this project was really an inspiring experience to me, as it opened a door to an unknown world where I had never been.</p>
<p><img src="http://thinkingscale.com/public/tongling_bridge.jpg" alt="Alt text" title="bridge@tongling" /></p>
<p>Another interesting project that I got involved in my graduate school in the <a href="http://en.ustc.edu.cn/">USTC</a> was to create a bridge health monitor system. My advisor, Professor <a href="http://dsxt.ustc.edu.cn/zj_ywjs.asp?zzid=322">Lu</a> led the effort to the software development.
Tentatively I created a BP neural network to predict the bridge health <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>. However, the performance was not good enough for the real-life application. On one hand, if the training data were not chosen properly, the result would not be right. In my experiment, there were often no enough data for the training. On the other hand, the training process was always time consuming, and not very effective.</p>
<p>I started realizing that the neural network might not be as effective or efficient as it does sound. It tries to simulate the way people think, but clearly there is still a long way to go before it can <em>think</em> like a man.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p><a href="http://www.cqvip.com/Read/Read.aspx?id=5868569">Content-based Interactive Emotional Image Retrieval</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a href="http://www.cqvip.com/qk/90287x/200401/9625006.html">The Application of the IGA in the CBIR</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p><a href="http://wenku.baidu.com/link?url=neJaqED7eN9S7jK37wbvWv53bj_ZI5JFMWPStqPCCJt1nqZVoVjzoJ3SXg34Kh_9eNpj80EBccVdCa-Ivpdrmobt5W-MHNj9H7vryy4KDEa">The Application of ANN to the Bridge Survey</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>