(综述论文)Near Real Time ETL翻译 Part 9 - 数据处理区与数据仓库间的流量调节,数据仓库的技术问题

3.5.4 Flow regulation between the Data Processing Area and the data warehouse
Once data pass through the transformation and cleaning process in the data processing area they have to be loaded at the data warehouse. In the traditional ETL setting, this problem had a straightforward solution: all data would pass through the appropriate data loader, which is a vendor specific tool that takes care of the loading of data in a manner that is more efficient than any other alternative. Apart from the dimension and fact tables of the warehouse, special care is taken for any indexes or materialized views, too. In the case of near real time ETL, the luxury of the off-line idle warehouse that is loaded with data is no longer available and a new architecture must be investigated mainly due to the fact that the warehouse server is used by the end users for querying purposes at the same time that data are loaded to the queried tables or materialized views.
3.5.4 数据处理区与数据仓库之间的流量调节
数据在数据处理区完成清洗转换后,就得往数据仓库里装了。传统ETL环境下这根本不是问题——所有数据都走专门的数据加载器(各家厂商都有自己的工具,加载效率比别的方法都高)。除了维度表和事实表,索引和物化视图也会被妥善处理。但到了准实时ETL这儿就行不通了,因为仓库服务器现在得一边让终端用户查数据,一边往被查询的表和物化视图里灌新数据,再想像以前那样等仓库闲下来慢慢加载根本不可能,所以必须研究新架构。
翻译说明:
- \"off-line idle warehouse\"意译为\"等仓库闲下来\"更生动
We envision an architecture where the flow of data from the DPA to the warehouse is regulated by a data Warehouse Flow Regulator (WFlowR) so that (a) the warehouse receives data according to the demands of users for freshness and completeness, (b) the querying processes that are initiated by the users are not significantly delayed and (c) no data are lost at the DPA side due to the overflow of produced tuples as compared to the availability of loading at the warehouse side. The ultimate desideratum is to maximize the satisfaction of the above antagonizing goals while spending as few system resources as possible in terms of main memory, hard disk and CPU cycles.
我们设想这样一种架构:在数据处理区(DPA)到数据仓库之间加个仓库流量调节器(WFlowR),让它来管三件事:(a)按用户对数据新鲜度和完整性的要求往仓库送数据;(b)确保用户发起的查询不会被明显拖慢;(c)防止因为仓库来不及加载,导致DPA这边处理好的数据积压太多最后丢失。终极目标就是用最少的内存、硬盘和CPU资源,在这几个互相矛盾的需求之间找到最佳平衡点。
翻译说明:
- \"antagonizing goals\"译为\"互相矛盾的需求\"
The technical issues that arise are similar to the case of flow regulation between sources and the DPA and can be listed as follows:
• Once again, we need to take care of communication protocols, compression and encryption as well as a decision on the granularity of transmitted data.
• A scheduling policy concerning the order of data transmission is important. More important than that, though, is the issue of modus operandi, which offers the following options: (a) periodic, (b) push-based, propagating blocks of produced records to the warehouse as they come and (c) pull-based, where the scheduler decides to push data towards the warehouse given a request by the users or a predicted decrease of the warehouse load. As already mentioned, we anticipate the DPA to be the means to alleviate the possible overload of the sources and the warehouse and therefore, we find the second alternative less likely to be of practical use in warehouses with high load. On the contrary, this is a really useful solution for underused warehouses.
• Again, since the warehouse load, the user requests, and the tuple production are antagonizing goals, it is possible to perform a certain amount of tuple reservation in an attempt to produce a solution that respects all the possible constraints and maximizes a satisfaction function that combines the above goals.
这里遇到的技术问题和数据源到DPA之间的流量调节很像,具体包括:
首先,通信协议、数据压缩和加密这些老问题又来了,还得决定传输数据的粒度大小。
其次,数据发送顺序的调度策略很重要,但更重要的是传输模式的选择:(a) 周期性发送;(b) 主动推送——数据记录一处理好就打包发往仓库;(c) 被动拉取——根据用户请求或预测到仓库负载降低时才发送。前面说过,DPA本来就是为了缓解数据源和仓库的负载压力,所以第二种方式在高负载仓库里不太实用,反而更适合那些闲着没事干的仓库。
最后,由于仓库负载、用户请求和数据生产这几个需求互相掐架,可能又得搞数据暂存,在满足所有限制条件的前提下,尽量让这几个目标的综合满意度达到最高。
翻译说明:
- \"modus operandi\"译为\"传输模式\"
- \"互相掐架\"替代\"antagonizing\"
---------------------------------------------------- 休息会儿 -------------------------------------------------------
3.5.5 Technical issues concerning the Warehouse
The data warehouse side of the ETL process is responsible for refreshing the contents of the warehouse with newly produced data that come all the way from the sources. This task is called Loading (the ’L’ in the E-T-L triplet) and comprises the following sub-tasks, in the following order:
• Loading of the dimension tables with lookup, reference values
• Loading of the fact tables with factual data
• Maintenance of indexes and materialized views
3.5.5 关于数据仓库的技术问题
ETL流程中的数据仓库端负责用从数据源一路传来的新数据更新仓库内容。这个任务叫做加载(就是ETL里那个\"L\"),按顺序包含以下子任务:
先往维度表里装查找值和参考值,
再往事实表里塞实际数据,
最后维护索引和物化视图。
In a traditional ETL environment, this task is performed via vendor specific tools called loaders, when the data warehouse is idle -or even off-line. To this day, vendor-specific loaders are the fastest way to implement the loading of data to a relational database. Also, dropping and recreating the indexes is sometimes of comparable time and efficiency with respect to storage and querying than incrementally maintaining them. Unfortunately, the luxury of an idle warehouse is not present in a near real time warehouse, where the loading has to take place concurrently with the answering of queries posed by the end users. Therefore, the near real time warehouse has to find an equilibrium between two antagonizing goals, the near real time refreshment of its contents and the on-line answering of queries posed by the end users. This antagonism has at least two facets:
• A contention for server resources, such as main memory and CPU cycles, that has a clear impact on the performance (at least as far as the end users perceive it.)
• A contention for database resources, such as transaction locks, in the case of strict isolation levels (although, we anticipate that the user requests for consistency of the data they receive is low - and in any case, contradicting the requirement for as fresh data as possible.)
在传统ETL环境下,这个任务是通过厂商专用的加载工具完成的,而且都是趁数据仓库闲着甚至离线的时候操作。直到今天,这些专用加载器仍然是往关系数据库里灌数据最快的方式。有时候,把索引删了重建反而比增量维护更省存储空间、查得更快。但到了准实时数据仓库这儿就没这么舒服了——这边得一边给终端用户查数据,一边往仓库里装新数据。所以准实时仓库必须在两个互相掐架的目标之间找平衡:既要准实时刷新数据,又要在线响应用户查询。这种矛盾至少体现在两个方面:
首先是服务器资源争抢,比如内存和CPU,这直接影响到性能表现(至少用户能明显感觉到变卡)。
其次是数据库资源争抢,比如在严格隔离级别下的事务锁(不过我们估计用户对数据一致性的要求应该不高——再说了,要求数据绝对一致本来就和\"越新鲜越好\"的需求自相矛盾)。
翻译说明:
- \"变卡\"替代\"performance degradation\"
The technical topics that result from this problem concern the hardware configuration of the warehouse, the design of the data warehouse, and the implementation of a loading mechanism that maximizes data freshness without delaying user 27 queries above a certain tolerance level. Specifically, these topics can be detailed as follows.
这个问题引发的技术议题主要涉及三方面:仓库的硬件配置、数据仓库的设计,以及如何在不超过用户查询延迟容忍度的前提下,实现数据新鲜度最大化的加载机制。具体来说包括以下要点:
---------------------------------------------------- 休息会儿 -------------------------------------------------------
A first concern involves the hardware configuration of a data warehouse. Although warehouses are not as rigid hardware environments as sources are, still, it is not straightforward how to migrate an existing warehouse to a highly parallel and partitioned architecture that is probably needed for a near real time warehouse. Moreover, the scientific community is not in possession of a cost model that can relate the extent of parallelism and partitioning required to sustain a near real time ETL process along with on-line querying.
A second concern, with a clear research challenge involves the design of the data warehouse. For the moment, a warehouse is built on the basis of a star or snowflake schema, combined with bitmap of B+ tree indexes for performance reasons. On top of these constructs, data marts, reports, web pages and materialized views are also maintained by the refreshment process, once data have been loaded to the fact tables. Remember that so far, in the context of our deliberations, we have abstracted all these constructs as materialized views. The research question that arises asks whether new kind of schema structures, along with novel kinds of indexes and even, novel kinds of materialized views are necessary for the implementation of near real time data warehousing.
首先得考虑数据仓库的硬件配置问题。虽然仓库不像数据源那样硬件环境死板,但要把现有仓库迁移到准实时所需的并行分区架构也不简单。更麻烦的是,学术界现在连个靠谱的成本模型都没有,根本算不清到底需要多少并行和分区资源,才能既撑住准实时ETL又兼顾在线查询。
第二个明显带研究挑战的问题是仓库设计。目前仓库都是基于星型或雪花模型搭建的,为了性能还会配上位图或B+树索引。数据加载到事实表后,刷新流程还要顺带维护数据集市、报表、网页和物化视图这些上层结构。注意啊,在我们之前的讨论中,这些都是统一抽象成物化视图来处理的。现在的研究难题是:要实现准实时数据仓库,是不是得搞新型的模型结构、新型索引,甚至全新的物化视图?
A third topic involves a scheduling problem: in what order do we schedule the loading of the data warehouse tables, and what time windows are allowed for each table, every time we activate a loading process? Naturally, this problem spans both traditional relations and data structures as well as the aforementioned novel structures that might prove useful for near real time warehousing. A parameter of the problem that might possibly complicate the solution has to do with the refreshment strategy: different choices have to be made in the case where the strategy is periodic as opposed to the case where the strategy is pull-based. Obviously, in the case of partitioning and parallelism, the problem is complicated (although the available computing power is much more) since we need to schedule the simultaneous loading of different constructs or parts of them.
A final topic of research concerns both the warehouse and the flow regulation towards it and has to do with the monitoring of the current workload and the forecast of the forthcoming workload in the very near future. Monitoring is a problem per se, since it has to trade off accuracy with simplicity and a small footprint. Predicting the near future in terms of user load is even harder, to a large extent due to the irregular nature of the behavior of the users. Still, any form of scheduling for the ETL process requires some estimation for the forthcoming user load and thus, a reasonable accurate such estimation is valuable.
第三个问题是调度策略:每次启动加载流程时,该怎么安排各个仓库表的加载顺序?每个表又该分配多长时间窗口?这个问题不仅涉及传统的关系型数据结构,也包括前面说的那些可能适用于准实时仓库的新型结构。有个让问题变复杂的参数是刷新策略——周期性的刷新策略和被动拉取的刷新策略,需要的调度方案完全不同。如果再加上分区和并行处理,问题就更棘手了(虽然计算资源变多了),因为得同时调度不同结构甚至同一结构不同分区的并行加载。
最后一个研究课题同时涉及仓库本身和流向仓库的数据流,就是要监控当前工作负载,并预测接下来很短时间内的负载情况。监控本身就有难度,毕竟要在精度、简易性和资源占用之间找平衡。而预测用户负载就更难了,主要因为用户行为本来就没什么规律。但不管用什么方式调度ETL流程,都得预估接下来的用户负载,所以这个预估还是得尽量靠谱才行。
翻译说明:
- \"找平衡\"准确传达\"trade off\"含义
- \"更棘手\"替代\"complicated\"
- \"没什么规律\"形容用户行为
- \"靠谱\"替代\"reasonable accurate\"更口语化
3.6 Summing up
Coming back to the big picture, the main idea is that the data processing area has to be regulated so that the user request for fresh and complete data is balanced against the computational needs of (a) the source systems (due to their regular workload and the non-negotiable nature of their configuration and performance), (b) the data processing area (due to the rigid necessity for transforming, and sometimes, cleaning source data to a schema and value set that is acceptable for the warehouse), and (c) the warehouse (due to the on-line support of user queries.) Both the user requests and the computational needs of the involved systems can be formulated in part as non-negotiable hard constraints and in part as negotiable soft constraints, accompanied by a satisfaction function that has to be maximized. This formulation can lead to a quality of service problem, provided that we have the means to relate the involved parameters and measures via a realistic cost model. The overall problem for near real time data warehousing is reduced to the design and regulation of such an environment in terms of (a) hardware configuration, (b) software architecture and, given the above, (c) resource allocation so that an acceptable compromise can be found for all these antagonizing goals.
3.6 总结
回到整体架构,核心思路是要对数据处理区进行调控,在用户对数据新鲜度和完整性的需求与以下系统的计算需求之间找到平衡:(a) 源系统(因为它们有常规业务要跑,而且配置和性能都不能随便改);(b) 数据处理区(必须把源数据转换清洗成仓库能接受的格式和值集);(c) 数据仓库(要实时支持用户查询)。用户需求和系统计算需求可以部分表述为不可妥协的硬性约束,部分表述为可协商的软性约束,并配上一个需要最大化的满意度函数。只要我们能通过实际成本模型把相关参数和指标关联起来,这个问题就能转化为服务质量(QoS)问题。
准实时数据仓库的总体问题,归根结底就是要从(a)硬件配置、(b)软件架构以及(c)资源分配这三个方面来设计和调控这样一个环境,在这些互相冲突的目标之间达成可接受的折中方案。
翻译说明:
- \"non-negotiable\"译为\"不能随便改\"
- \"归根结底\"整合原文\"reduced to\"的语义


