Automatic generation of Brazilian sign language windows for digital TV systems
© The Brazilian Computer Society 2012
Received: 26 August 2011
Accepted: 7 August 2012
Published: 11 September 2012
Deaf people have serious difficulties accessing information. The support for sign language (their primary means of communication) is rarely addressed in information and communication technologies. Furthermore, there is a lack of works related to machine translation for sign language in real-time and open-domain scenarios, such as TV. To minimize these problems, in this paper, we propose an architecture for machine translation to Brazilian sign language (LIBRAS) and its integration, implementation and evaluation for digital TV systems, a real-time and open-domain scenario. The system, called LibrasTV, allows the LIBRAS windows to be generated and displayed automatically from a closed caption input stream in Brazilian Portuguese. LibrasTV also uses some strategies, such as low time consuming, text-to-gloss machine translation and LIBRAS dictionaries to minimize the computational resources needed to generate the LIBRAS windows in real-time. As a case study, we implemented a prototype of LibrasTV for the Brazilian digital TV system and performed some tests with Brazilian deaf users to evaluate it. Our preliminary evaluation indicated that the proposal is efficient, as long as its delays and bandwidth are low. In addition, as previously mentioned in the literature, avatar-based approaches are not the first choice for the majority of deaf users, who prefer human translation. However, when human interpreters are not available, our proposal is presented as a practical and feasible alternative to fill this gap.
Information and communication technologies (ICTs) are rarely developed taking into account the specific requirements and needs of deaf people . Their primary means of communication are the sign languages , but support for them is rarely addressed in the design of these technologies. In consequence, deaf people have much difficulty communicating and accessing information.
In scientific literature, some works were developed to help deaf people communicate and access information [23, 24, 39].These works offer technological solutions for daily activities which enable them to watch and understand television, to interact with other people, to write a letter, among others. Examples of these solutions are the use of emotive captioning in movies and television programs  and the development of games for training deaf children .
Other works specify formal representations , dictionaries  and recognition  in sign languages (SL). Fusco  proposes a formal representation of signs in Brazilian sign language (LIBRAS) based on XML. The author also developed a 3D virtual animated agent to represent these signs. Buttussi et al.  proposed a multilingual dictionary for storage, visualization and search of signs in SLs. The authors also defined an authoring tool which allows different communities to extend the dictionary for their own SL. Starner et al.  specify a system based on hidden Markov models (HMM) for real-time recognition of sentences in American sign language (ASL) by using a camera to monitor the user’s hands. Kaneko et al.  defined a text-based computer language (TVML) to generate graphic animations. These animations are created by mapping human movements to a predefined graphic skeleton. Human movements are captured by using an optical sensing technology, and are mapped to a 3D model skeleton to generate graphic content for TV.
There are also other works related to machine translation for sign languages (SLs) [17, 31, 34, 35, 42, 44]. Veale et al. , for example, described a multilingual translation system for translating English texts into Japanese sign language (JSL), ASL and Irish sign language (ISL). The translator is based on a blackboard control architecture with a set of demons that cooperate to generate the translated contents. The work explores and extends some artificial intelligence (AI) concepts to SL , such as, knowledge representation, metaphorical reasoning, and blackboard system architecture , but there is no testing or experimentation to evaluate the solution. Therefore, it is not possible to draw conclusions about the feasibility of the solution, the quality and speed of translation, among others.
Zhao et al.  developed an interlanguage-based approach for translating English text into ASL. Input data are analyzed and an intermediate representation (IR) is generated from their syntactic and morphological information. Then, a sign synthesizer uses the IR information to generate the signs. However, as in Veale et al.’s work , no test was conducted to evaluate the solution. Othman and Jemni  proposed a strategy for word alignment based on Jaro-distance and included it into a statistical machine translator for English to ASL. However, only the word alignment has been evaluated. No test or experiment was conducted to evaluate the quality and speed of translation, the application domain, among others.
Morrissey  proposed an example-based machine translation (EBMT) system for translating text into ISL. To do this task, an EBMT approach needs a bilingual corpora. However, due to the lack of a formally adopted or recognized writing system for SL, it is hard to find corpora in SL. Thus, the authors construct their bilingual corpora by using annotated video data. However, the data set was developed from a set of “children’s stories”, which restricts the translation for that particular domain.
San-Segundo et al. [17, 34, 35] proposed an architecture for translating speech into Spanish sign language (LSE) focused on helping deaf people when they want to renew their identity card or driver’s license. The idea of the system is to make the dialogue between deaf people and public officials easier in this kind of service. This translation system consists of three modules: a speech recognizer, a natural language translator and an animation module. The speech recognizer is used for decoding the spoken utterance into a word sequence. The natural language translation module converts the word sequence into a sequence of signs in LSE and the animation module plays the sign sequence. However, this solution is also restricted to a particular (or specific) domain (public services) and the time needed for translating speech into LSE (speech recognition, translation and signing) is around 8 s per sentence, which makes the solution unfeasible for real time domains (e.g., television).
Thus, these works have some limitations. Some of them do not have an assessment of the feasibility and quality of the solution [32, 42, 44], others are only applied to specific domains [17, 31, 34, 35] or are not efficient considering signing and translation speed [17, 34, 35]. These limitations reduce their application to real-time and open-domain scenarios, such as TV. In addition, there are few papers related to support this topic for Brazilian sign language (LIBRAS) in ICTs [2, 38]. These works focus on the synthesis of LIBRAS signs  or in the presentation of signs in the SignWriting1 language , but there is no proposal for investigating strategies for machine translation to LIBRAS.
Therefore, the research question of the current work is “how can we address deaf people’s communication problems in real-time and open-domain scenarios (e.g., TV), especially when human interpreters are not available?” To answer this question, in this paper, we proposed an architecture for machine translation to Brazilian sign language (LIBRAS) in real-time and open-domain scenarios. This architecture, called LibrasTV, was integrated, implemented and evaluated in a digital TV (DTV) environment (real-time and open-domain) and its components allow the LIBRAS window to be generated and displayed automatically from a closed caption input stream in Brazilian Portuguese (BP).
LibrasTV also uses strategies, such as a low time consuming text-to-gloss machine translation strategy and SL dictionaries to minimize the computational resources needed to generate the sign language window in real-time. The text-to-gloss machine translation strategy combines the use of syntactic transfer rules, defined by human specialists (also used in other works), with new strategies of machine translation for SL, such as the use of (1) a statistical data compression method to classify the input tokens and (2) simplification strategies to reduce the complexity of the input text, before the application of translation rules. In addition, SL (LIBRAS) dictionaries store a visual representation of signs and can be stored in clients or loaded from the network channel, allowing SL regional aspects to be respected. Finally, LibrasTV also allows the LIBRAS window to be enabled, disabled, resized or repositioned in the display, allowing users to configure the display of SL windows according to their preferences.
As also mentioned by Kennaway et al. , it is important to point out that LibrasTV does not intend to replace human interpreters, since the quality of machine translation and virtual signing are still not close to the quality of human translation, and signing, especially considering the difficulties of machine translation approaches to explore semantic and contextual ambiguities , and the difficulties of virtual signing approaches to express emotions and represent movements in a more natural way . Thus, the idea is to develop a complementary, practical, high speed and low cost solution that can be used, for example, to provide information for the deaf when human interpreters are not available.
In order to validate these aspects, we have implemented a prototype of LibrasTV for the Brazilian digital TV system (SBTVD) and performed a set of preliminary tests. The objective tests evaluated the speed and cost of the solution in terms of speed of translation, signing and bandwidth. The subjective tests involved Brazilian deaf users to evaluate the proposal in practice and its feasibility for them. The description of LibrasTV, the implementation of the prototype for SBTVD and the set of tests with this prototype will be described in the next sections.
This paper is organized as follows. In Sect. 2, we review the main concepts about LIBRAS. In Sect. 3, we describe LibrasTV. In Sect. 4 we integrate the components of LibrasTV in DTV systems. In Sect. 5, we describe an implementation of LibrasTV for the SBTVD. Some tests to evaluate the proposed solution are described in Sect. 6. Final remarks are given in Sect. 7.
2 LIBRAS linguistic issues
Sign languages are visual languages used by deaf people as their primary means of communication . According to Brito , they are considered natural languages, because they came from the interactions between deaf people and they can express any descriptive, concrete, rational, literal, metaphorical, emotional or abstract concept.
Like spoken languages, sign languages have their own grammars that consist of several linguistic levels, such as morphology, syntax and semantics . They also have lexical items that are called signs . The main difference is the visual–spatial mode. Another difference is related to the language structure. While spoken languages have a sequential structure, i.e., phonemes are produced sequentially in time, sign languages have a parallel structure and can produce signs using several body parts simultaneously.
handshape: finger positions and their movements;
location: the part of the body where the sign begins;
hand movements and facial and/or body expressions—non-manual features (NMF);
In BP: O urso (S) matou (V) o leão (O). (The bear killed the lion.)
In LIBRAS: URSO (T), LEÃO MATAR (C). (Bear, lion to kill.)
In BP: Paulo enviou o livro ao amigo. (Paul sent the book to a friend.)
In LIBRAS: LIVRO AMIGO P-A-U-L-O ENVIAR. (Book friend P-A-U-L-O to send.)
In addition, LIBRAS has some regional differences. The signs could be represented differently according to each region. For example, some signs in LIBRAS are represented differently in the northeast, southeast and south of Brazil.
Currently, the support for LIBRAS in TV is restricted to manual devices, where a window with a LIBRAS interpreter is transmitted over the video program. This solution has high operational costs (cameras, studio, staff, etc.) and needs a full-time interpreter, which restricts its use to a small portion of TV programming. In addition, according to the scientific literature, there is a lack of sign language machine translation approaches developed for real-time and open-domain scenarios, such as TV, as well as a lack of machine translation solutions developed for LIBRAS. These problems motivate the development of LibrasTV solution presented in the next section.
3 LibrasTV architecture
In this section, we describe the LibrasTV architecture. As mentioned in Sect. 1, the proposal of LibrasTV aims to address deaf people’s communication problems in real-time and open-domain scenarios, such as DTV, especially when human interpreters are not available. To address this problem, the LibrasTV architecture is composed of a set of components that allow automatic generation of a LIBRAS windows from closed caption input stream in BP.
These components include a low time consuming text-to-gloss machine translation strategy and LIBRAS dictionaries to minimize the computational resources needed to generate the LIBRAS window in real-time. As mentioned above, the text-to-gloss machine translation strategy was planned for open-domain scenarios and it combines morpho-syntactic transfer rules, defined by human specialists, with new strategies, such as a statistical data compression method to classify the input tokens (words) and a simplification strategy to reduce the complexity of the input before the application of these translation rules, increasing the speed of the translation. In addition, LIBRAS dictionaries are also used to avoid the signs rendering in real-time, which is a very time consuming task. LIBRAS dictionaries store visual representations of signs in LIBRAS (pre-rendered) and each sign has a code (e.g., a textual representation) associated with its representation. Thus, it is possible to generate a video of LIBRAS from the combination of LIBRAS dictionary signs. These dictionaries could be stored in clients or loaded from the network channel. This feature allows the preservation of regional aspects of language.
Several meetings with deaf and LIBRAS interpreter researchers2 were performed to discuss the key aspects of the problem, such as issues related to the translation into LIBRAS, the low acceptance of solutions based on virtual signing (i.e., avatars) [10, 13, 22, 35], size, position and synchronization of the LIBRAS window, among others. These meetings are part of the adopted process model and occurred both at the project design and implementation stages. Thus, it was possible to specify and explore their main requirements before and during the proposal development. A summary of these meetings are given in the Appendix.
Finally, the Exhibitor component receives the encoded LIBRAS stream from the channel. The Exhibitor component also decodes, synchronizes and displays the signs to generate the LIBRAS window. This component is composed of two modules: decoding and exhibition. The decoding module extracts the sequence of glosses and synchronization information from the encoded LIBRAS stream. The exhibition module associates each gloss with its visual representation stored in the LIBRAS dictionary. Thus, the sequence of glosses is converted to a sequence of visual representations that are synchronized to generate the LIBRAS window.
Synchronization between closed caption input in BP and the LIBRAS window output is performed by using the axis-based synchronization model . This model defines synchronization points that are inserted in the stream using timestamps based on a global timer. In this case, the global timer is the reference clock of the closed caption input stream. This clock is extracted from closed caption and is used to generate the presentation timestamps for the signs in the LIBRAS window.
In the next subsections we will detail the LibrasTV components.
3.1 LIBRAS translator component
The LIBRAS translator is responsible for translating the source input stream (i.e., the closed caption stream) into a textual representation in LIBRAS (sequence of glosses) and for encoding this representation along with synchronization information to be transmitted in a communication channel. According to Fig. 1, it is composed of four basic modules: closed caption extraction, machine translation, synchronization and encoding.
The closed caption extraction modules are used to convert the closed caption input streams into a sequence of words in BP. The synchronization module is used to synchronize the closed caption input and the LIBRAS window output. As mentioned earlier, it extracts the reference clock from the input and uses it to generate the timestamps for the signs presentation. The machine translation and encoding modules will be detailed in the next subsections.
3.1.1 Machine translation module
In the first step (i.e., the Tokenizer step), the text in BP is split into a sequence of words (or tokens). Afterwards, the tokens are classified into morphological-syntactic categories (the morphological-syntactic classifier step). To do this task, we use the PPM-C , a variant of prediction by partial matching (PPM) . PPM is a statistical data compression method based on N-order Markov models. It was chosen due to its ability to build accurate statistical models  and to its previous use in other classification problems [6, 27, 28].
PPM builds a statistical model and uses it to store the frequency of different sequences of elements found. After the model is built, the next element of the sequence can be predicted according to its previous N elements. Since higher values for N increase the time needed to compute the algorithm, the order of the algorithm must be chosen so that it will have good results while maintaining acceptable completion times. The PPM-C variant is more efficient than the original implementation in terms of running time and data space exchange for marginally inferior compression.
The morphological-syntactic classifier models morphological-syntactic classes as elements in PPM-C. This model stores sequences of morphological-syntactic classes taken from a corpus of morphological-syntactic classified BP texts. Once a sentence is received for classification, the most likely morphological-syntactic class of each token is selected according to its context in PPM model.
After the tokens are classified, we apply some translation rules (defined by LIBRAS specialists) to translate these tokens (or words) for a representation in gloss notation. Initially, we simplify the text by removing some tokens (the Remove Tokens step). We chose this step because LIBRAS does not define prepositions and articles. Thus, these classes of tokens can be removed. Afterwards, some tokens (or words) are replaced (the Lexical Replacement step) in order to adapt the meaning of the sentence rewritten to LIBRAS, since the LIBRAS vocabulary is smaller than BP’s . For example, the words HOME, HOUSE, HABITATION in BP have the same sign (i.e., the same visual representation) in LIBRAS, the HOME sign. Furthermore, while the BP verbs have a high degree of inflection, the LIBRAS verbs do not inflect. In this case, the BP verbs are replaced by non-inflected gloss verbs (i.e., the LIBRAS verbs). To do this replacement, we use a set of BP to LIBRAS synonyms (BP-LIBRAS dictionary). Finally, proper names and technical terms are spelled in LIBRAS [by handshapes that represent the letters of the token (word)]. Thus, we also apply a dactylology replacement to spell proper names and technical terms. The output generated is a representation in LIBRAS gloss notation.
3.1.2 Encoding module
The encoding module is responsible for encoding the sequence of glosses and the synchronization information generated by machine translation and synchronization modules, respectively. The output of this module, the encoded LIBRAS stream, is used by the Exhibitor component to display and synchronize the signs and, therefore, to generate the LIBRAS window. In this subsection, we describe the encoding protocol used to produce this stream.3
Syntax of SCM message
Syntax of SDM message
According to Tables 1 and 2, the SCM and SDM messages begin with their identification and length fields (sign_control_id and sign_control_length for SCM and sign_data_id and sign_data_length for SDM). These fields are used to identify the type of message (SDM or SCM) and the message length in bytes, respectively.
Values of resolution field
1,920 × 1,080
1,280 × 720
640 × 480
960 × 540
720 × 480
320 × 240
Reserved for future use
On SDM, the gloss_data_bytes fields transport the glosses (used to reference signs on LIBRAS Dictionary) that are being encoded. Since this field is inside a loop, several signs (glosses) can be transmitted in the same message. The field number_of_signs specifies the number of signs encoded in each SDM.
The Exhibitor component is responsible for extracting the data (sequence of glosses and synchronization information) from the encoded LIBRAS stream. It is also responsible for decoding, synchronizing and displaying the signs synchronously. According to Fig. 1, the Exhibitor is composed of two main modules: decoding and exhibition. The decoding module receives the encoded LIBRAS stream and extracts the sequence of glosses and the synchronization information from this stream. The exhibition module gets the sequence of glosses, associates each gloss (sign) with its visual representation in the LIBRAS dictionary and displays the visual representation of the signs. To synchronize the signs with the closed caption, the Exhibitor component uses the synchronization information.
Thus, the sequence of glosses is converted into a sequence of visual representations that are synchronized to generate the LIBRAS window. We also define a neutral configuration (position, background color, brightness) to start and finish the representation of each sign. Thus, the exhibition module can smooth the transition between consecutive signs. The exhibition module can also offer additional features, such as to enable, to disable, to resize or to reposition the LIBRAS window. This feature allows users to choose their display settings.
3.2.1 LIBRAS dictionary
As each sign has a fixed code, the visual representation of signs can be customized. For example, the representation could be an animation with a 3D virtual animated agent (3D avatar) or a recorded video with a sign language interpreter. The 3D avatar could also be customized, for example, as a cartoon character for children. Moreover, regional specificities of LIBRAS language are respected as different dictionaries can be used to represent the same sign.
In the next section, we will describe how to integrate the components of LibrasTV into a DTV system.
4 Integration of LibrasTV into DTV systems
The LIBRAS translator component is integrated with TV stations (or content providers). It receives the closed caption input stream, translates to a sequence of glosses in LIBRAS and encodes them with synchronization information in messages of the encoded LIBRAS stream. These messages will be multiplexed in MPEG-2 TS along with audio, video and data;
The Exhibitor component is coded as an interactive application, called the LibrasTV application. It will run on DTV receivers. The LibrasTV application extracts the data from the encoded LIBRAS stream, decodes, synchronizes and displays the LIBRAS window with the of the LIBRAS dictionary;
LIBRAS dictionary may be stored in an extended memory device (e.g., a USB storage device) which will be plugged into a DTV receiver. In this case, we suppose that DTV receiver supports extended memory devices. Alternatively, the LIBRAS dictionary can be loaded from the interactive channel;
The SCM and SDM messages of encoded LIBRAS stream (defined in Sect. 3.1.2) are transported in MPEG-2 TS. One interesting alternative is to encapsulate these messages in events defined by Digital Storage Media-Command and Control (DSM-CC) specification .
The DSM-CC stream events are transmitted in structures called Stream Event Descriptors allowing synchronization points that are defined at the application level. This structure has a field called eventNPT, which carries a timestamp related to the reference clock of the MPEG-2 TS stream, the Program Clock Reference (PCR). It enables applications to receive events and synchronize their actions with other media, such as video or audio streams. This structure also has a privateDataBytes field to transport private data. Thus, the timestamps for synchronization information and the (SCM and SDM) messages can be encapsulated in this structure (in eventNPT and privateDataBytes fields, respectively).
According to Fig. 3, the LIBRAS translator component (highlighted), which is located at the TV station, receives the closed caption input stream. Then, a process of closed caption extraction is executed, followed by a process of translation of the BP text to a sequence of glosses in LIBRAS. In the next step, the sequence of glosses is encoded in SDM messages and encapsulated in DSM-CC stream events. SCM messages are also generated periodically and encapsulated in DSM-CC stream events. The DSM-CC stream events are packaged, multiplexed in MPEG-2 TS and transmitted in the DTV signal along with the audio and video streams. The TV station also transmits the LibrasTV application (highlighted), which is encoded using DSM-CC Object Carousel standard .
At the DTV receiver side, the LibrasTV application receives the DSM-CC stream events, decodes, synchronizes and displays the signs, according to the LIBRAS dictionary (highlighted) and the initial settings defined in SCM messages. This dictionary may be stored in an extended memory device (e.g., a USB device) or loaded by the return channel (interaction network).
Integrating all components with the TV station—the LIBRAS window would be generated at the TV station and transmitted in MPEG-2 TS as a secondary video.
Integrating all components with DTV receivers.
The first solution consumes high bandwidth, since a secondary video is transmitted in MPEG-2 TS, and does not preserve regional differences because the same LIBRAS window will be generated for all users. The second solution also has some limitations, since it requires a lot of processing by DTV receivers to translate, encode, decode, synchronize and display the LIBRAS window. These arguments justify our decision to integrate LibrasTV using the approach shown in Fig. 3 (i.e., with the LIBRAS translator component integrated with TV stations, the Exhibitor component coded as an interactive application and running on DTV receivers and the LIBRAS dictionary stored in an extended memory device).
LibrasTV could also be integrated with other platforms, such as Web and Digital Cinema. On the Web, for example, the LIBRAS translator component could run on a Web server, generating and streaming the encoded LIBRAS stream from an input video with closed captioning, and the Exhibitor along with the LIBRAS dictionary could run on the client side, generating and displaying the LIBRAS window in the client Web page. Another alternative would be to run all LibrasTV components on a Web server, which would receive the video input and generate the LIBRAS video to be streamed to the client Web page. It would also be possible to adapt LibrasTV input to raw text, allowing the generation of LIBRAS windows from Web texts. In the Digital Cinema, the LibrasTV input could be adapted to receive subtitles based on the Digital Cinema Package (DCP) format.5 These subtitles could be translated into a LIBRAS video that could be streamed to mobile devices, such as a smartphone or a tablet, allowing a deaf person to see the LIBRAS translation on these devices. However, the implementation and validation of LibrasTV in these platforms (Web and Digital Cinema) is outside the scope of this work.
5 Brazilian digital TV system case study
The case study in this work focuses on the SBTVD. To validate the LibrasTV proposal, we implemented a prototype for SBTVD. The next subsections will detail the developed case study.
5.1 LIBRAS translator
The developed prototype of the LIBRAS translator component was implemented considering the Sect. 3.1 requirements and using C++ programming language.
The closed caption extraction module was developed based on definitions of ABNT NBR 15606-1 . This module receives an MPEG-2 TS streaming and extracts BP sentences and synchronization information (i.e., timestamps) from closed captions packets. These timestamps are inserted into DSM-CC stream events.
The morphological-syntactic classification is done based on a Portuguese language corpus, called “Bosque”7 . This corpus was developed by Floresta Sintá(c)tica (syntactic forest) project  and has 9,368 sentences and 186,000 words. These sentences were obtained from “Folha de São Paulo8”, a Brazilian newspaper, and also from “Público9”, a Portuguese newspaper. The entire corpus was morphologically and syntactically classified and fully reviewed by linguists. In our case, we use only the Brazilian Portuguese part of this corpus.
The MorphologicSyntacticAnalyzer class has an instance of the PPMManager class. This class builds a statistic model based on Bosque’s sentences. After that, the PPM-C algorithm  is applied to classify the tokens (morphologically and syntactically). The Markov order defined empirically for the PPM model was 5. This value was chosen in order to maintain a good threshold between accuracy and run time.
The RuleAnalyzer class applies the translation rules, developed by a human specialist, to the sequence of tokens and uses a BP to LIBRAS dictionary to do the lexical replacement step. By excluding the rules from the Remove Tokens, Lexical Replacement and Dactylology Replacement steps, nine high-level translation rules developed by the human specialists were taken into account. The translation rules are loaded from a file and are specified using an XML representation that allows LIBRAS specialists to easily add new rules and also modify or remove previously defined ones. Thus, it is possible to extend the set of translation rules in a simple way, just editing that file.
In this XML representation, each rule has a count property which indicates its number of linguistic elements. For each element, there is a title property to identify it morphologically or syntactically and a newpos property to indicate the new positioning of the element after the rule is applied with the value of “-1” meaning that the word must be removed. There is also an optional tag newproperty to indicate property changes in the element (e.g., every verb in LIBRAS must be in the infinitive form, so the newproperty tag is used to specify this). The elements are defined inside the rule in the order that they must appear in the original text for the rule to be applied. Each rule also has an active property and only rules that are active will be applied by the system, making it easier to test different sets of rules without having to necessarily remove a rule.
The BP to LIBRAS dictionary was developed in two parts. The first part was extracted from the “LIBRAS Illustrated Dictionary of Sao Paulo” , a LIBRAS dictionary which has 43,606 entries, 3,340 images and 3,585 videos, where an interpreter represents the LIBRAS signs. The other one was generated by a human specialist from the verbal inflection variation, where each inflected verb has its translation to its infinitive form. The full dictionary consists of 295,451 entries.
Finally, the list (sequence) of glosses generated by the machine translation module is used by the encoding module to generate the encoded messages, i.e., the SCM and SDM messages, according to module described in Sect. 3.1.2. The SCM are transmitted at periodic intervals of 100 ms. Those messages are then encapsulated in DSM-CC stream events, packaged and sent via User Datagram Protocol (UDP) for multiplexing.
5.2 LibrasTV application
According to Fig. 7, the main class is the LIBRASController. It has instances of LIBRASProcessor and LIBRASPlayer classes which implement the functionalities of the Decoding and Exhibition modules, respectively. It also has instances of other classes of Ginga (e.g., javax.tv.xlet.Xlet, com.sun.dtv.ui.event.UserInputEventListener) to control the user input, to manage the application life cycle, among others.
We had several discussions about the LIBRAS dictionary in the meetings with deaf and LIBRAS interpreter researchers. The main issue discussed was the low acceptance of avatar-based solutions12 by deaf users, also mentioned in other works [10, 13, 22, 35]. According to these authors, a solution based on videos recorded by LIBRAS interpreters, if technically feasible, would be more appropriate, because it would be probably more natural. Thus, we have initially used a LIBRAS dictionary available on the Internet . In this dictionary, each sign of LIBRAS is a video represented by a human interpreter. However, we observed some problems for generating a LIBRAS window in the preliminary tests. For example, the transitions between two consecutive signs were not smooth. The final position of a sign was usually different from the initial position of the next sign. Other problems were the position and configuration of hands, the distance to the camera and the differences in lighting, which disturbed the clarity of the generated signs.
Another problem is related to the dictionary’s update. As LIBRAS is a living language and new signs may arise, it would be always necessary to record new videos for new signs with the same interpreter and under the same conditions of the other signs.
Thus, the deaf and LIBRAS researchers considered that use of avatar is a valid alternative, in the sense that the use of avatars would allow access to a TV program’s audible information, especially when a human interpreter is not available. In Sect. 6, we will discuss some results obtained from tests with users to verify this fact.
After implementing the prototype, some tests with the prototype were performed to evaluate the proposed solution. These tests include quantitative measures and qualitative evaluation with deaf users. Section 6.1 describes the test environment. Sections 6.2 and 6.3 describe the tests and discuss the results obtained.
6.1 Test environment
To perform the tests with the prototype, we used two mini-computers with an Intel Dual Core T3200 2 GHz processor and 4 GB of RAM. One of these computers was used to run the LIBRAS translator prototype (described in Sect. 5.1) and the other to run the OpenGinga with the LibrasTV application prototype (described in Sect. 5.2). The operating system used in both was the Linux Ubuntu 10.0.4 kernel 2.6.32.
The carousel generator generates a data stream with the LibrasTV application;
The video streamer transmits via streaming some test (MPEG-2 TS) videos with BP closed captioning to the LIBRAS translator and multiplexer;
The LIBRAS translator receives the video stream and generates the encoded LIBRAS stream;
The multiplexer receives the video, the encoded LIBRAS stream and the data stream. Afterwards, it multiplexes them into a single stream (MPEG-2 TS) and transmits it via streaming to the other computer that runs OpenGinga;
OpenGinga receives this stream, plays the video, decodes and runs the LibrasTV application that will display the LIBRAS window synchronously.
Videos with closed captioning used in the evaluation
This video is a part of a news program presented on 14 October 2008 on TV Globo, a Brazilian TV station
This video is a part of a news program presented on 31 January 2007 on TV Globo, a Brazilian TV station
This video is a part of a movie produced by UFPB TV (the TV of Federal University of Paraiba-UFPB), developed with academic purposes, and consists of a dialogue between two characters
6.2 Objective measures
In this section, we describe some objective tests performed to evaluate quantitative measures as: the machine translation error, the delay of the LibrasTV, the bandwidth used by the encoded LIBRAS stream, among others.
6.2.1 Machine translation
Initially, we applied the multiple cross-validation technique to evaluate the performance of the morphological-syntactic classifier. In the cross-validation technique, the data set is partitioned into K equal subsets. Then the model is built (or trained) with all the subsets except the one that is used to compute the validation error. The procedure is repeated K times and each cycle uses a different subset for validation.
Measures of the percentage of correct morphological-syntactic classifications over the Bosque corpus
Correct classification (%)
As can be seen, the classifier had an average accuracy of 81.88 % in the classification of the validation sets, i.e, the average rate of misclassification was <20%. Since prepositions and articles, for example, are later removed by the machine translation module (see Sect. 3.1.1), the prepositions and articles misclassifications probably have no effect on the quality of translation. In practice, this implies that the impact of this misclassification rate may be even lower. A further analysis to evaluate the impact of this misclassification rate in the quality of translation is a proposal of future work.
According to Table 6, in these tests, LibrasTV measurements were better than SBP measures for all n-grams precisions. The values of the 4-gram for BLEU was 12 % and for WER was 75.3 %, which helps to evaluate how difficult this task is in an open scenario such as DTV. However, this result is not sufficient to conclude if the proposed translation is good or not. According to Su and Wu , objective evaluation based on objective measures is insufficient to evaluate the quality of translation for SLs, since SLs are visualized and gestural languages. Thus, it is also necessary to perform subjective tests with users. In Sect. 6.3, we will describe some tests performed with Brazilian deaf users to evaluate it.
6.2.2 Delay and bandwidth
We also performed some tests to evaluate the delay of each LibrasTV module and the bandwidth used by the encoded LIBRAS stream.
Measure of the average delay of each module of LibrasTV
According to Table 7, the average delay to run all LibrasTV modules (i.e., the sum of CC Extraction, Machine Translation, Coding, Decoding and Exhibition delays) was <43 ms. The maximum delay obtained (considering the maximum delay of each module) was 142.26 ms, whereas the minimum delay was 20.509 ms. Considering that the test was conducted with an open and representative vocabulary18 and in a real scenario, the LIBRAS windows can be generated in real-time and with an average time probably much smaller than the time taken for a human translation (although with a lower quality of translation too). Furthermore, this average delay time is smaller than the time used in other related works, such as the solution proposed by San-Segundo et al. [17, 34, 35] which reported an average delay for translating speech into LSE of around 8 s per sentence.
According to Fig. 10, the absolute bit rate used to transmit the encoded LIBRAS stream was always <50 Kbps. The average bit rate was 5.37 Kbps for Video1 and 5.57 Kbps for Video2. As seen in Fig. 10, the bandwidth used by the encoded LIBRAS stream was very low. Thus, it may also be possible to transmit this stream in other network platforms, such as the Web. Furthermore, this bandwidth was significantly lower than the bandwidth used if we choose to transmit LIBRAS window as a video instead of encoded LIBRAS stream.
6.3 Evaluation with users
The subjective measures were collected from questionnaires answered by deaf users. The purpose of these tests was to provide qualitative measures about some aspects of the solution, such as the translation quality, the ease of understanding of the LIBRAS window generated by the solution and its naturalness, among others.
User’s profile in terms of level of education and knowledge of LIBRAS and BP
Level of education
Knowledge in LIBRAS (1-to-5)
Knowledge in BP (1-to-5)
Complete elementary school
Incomplete high school
Complete high school
Users were invited to watch the Video3 (Table 4) with the LIBRAS window automatically generated by the LibrasTV and to complete a questionnaire about some aspects of the solution. To perform the test, we generated a SL Dictionary with 119 signs in LIBRAS.20
Sample of questionnaire
(1) Easy to understand? (5 clear, 1 confusing)
5 4 3 2 1
(2) Good LIBRAS grammar? (5 perfect, 1 bad)
5 4 3 2 1
(3) The signing is natural? (5 move likes a person, 1 like a robot)
5 4 3 2 1
(4) Hand and arm movements are correct? (5 perfect, 1 bad)
5 4 3 2 1
(5) Facial expressions are correct? (5 perfect, 1 bad)
5 4 3 2 1
(6) Which choice on the right matches with the content?
(What did the couple discuss?)
B: financial problems
C: problems with children
Average scores for the questions
Quality of movements
Quality of facial expressions
According to Table 10, the clarity had the highest score (3.6). This result is probably compatible with the match-success test, since 80 % of users chose the correct answer to the question. However, this measure would be statistically more significant if more questions (with more contents and more users) were used to assess it. Further analysis with a greater number of questions, contents and users is a proposal of future work.
The quality of movements and facial expressions also had a moderate score (3.4 and 3.6, respectively). However, these measures had the highest SD (1.51 and 1.67, respectively), which shows that the opinions of users in these aspects were more divergent. Grammatical correctness and the naturalness of signs, on the other hand, had the lowest scores (2.1 and 2.8, respectively). As in , we observed some probable causes for this outcome during the tests. For example, during the tests, there were discrepancies between users about the structure of the same sentences in LIBRAS. Like other sign languages (e.g., LSE )), LIBRAS has an important level of flexibility in the structure of sentences. This flexibility is sometimes not well understood and some of the possibilities were considered as wrong sentences.
Another probable cause observed during the tests was that avatar signing naturalness is not comparable to a human signing. As mentioned in previous works [10, 13, 22, 35], avatar-based approaches are not the first choice for the majority of deaf users, who prefer human translation and signing. One of the reasons for this preference, according to Kipp et al. , is the difficulty of virtual signing approaches to represent emotions and movements with less rigidity. Thus, we believe that it is necessary to keep investing more effort to increase flexibility and naturalness of avatar-based solutions.
Finally, there were also discrepancies between users about the correct signing of some signs. For example, users disagreed about the correct signing of the CAFÉ (coffee) and MERCADO (market) signs. One alternative to reduce these discrepancies would be to use custom LIBRAS dictionaries in the users’ DTV receivers. However, the development of custom LIBRAS dictionaries is a very time consuming task. Another alternative would be to invest more effort to standardize LIBRAS. In this case, a wider dissemination of LIBRAS in ICTs (e.g., in TV and the Web), would help to standardize it as has also happened in other languages (e.g., in other minority languages in Spain ).
7 Conclusions and future works
In this paper, we proposed an architecture for automatic generation of LIBRAS windows and we also implemented and evaluated it for DTV systems. The idea is to improve the access of deaf users to digital contents, especially DTV contents, when a human interpreter is not available or the cost of using human interpreters is not feasible.
Some alternatives to integrate LibrasTV in a DTV were discussed and a case study was developed for the SBTVD. This case study includes the implementation of a prototype of the proposed solution and some tests with deaf users to measure some aspects of the proposal for a real DTV system. This initial evaluation indicates that the proposal is efficient in the sense that its delay and bandwidth are low. In addition, as shown in previous works [10, 13, 22, 35], avatar-based approaches are not the first choice for the majority of deaf users, who prefer human translation. However, when human interpreters are not available, our proposal is presented as a practical, complementary and viable alternative to fill this gap.
Despite the focus of this work being the translation for LIBRAS and its evaluation for DTV, we believe that the proposal would be adaptable to other platforms (e.g., Web, Mobile, Digital Cinema) with minor modifications (although we have not evaluated it yet). For example, we believe that we could adapt the solution for the Web platform just by running all LibrasTV components on a Web server. It is also interesting to investigate the use of audio inputs for machine translation to LIBRAS from BP speech. Thus, a proposal of future work is to adapt the proposed solution for audio inputs and other platforms.
Another proposal for future work is the use motion capture tools (e.g., Microsoft Kinect22) to build the signs of the SL dictionary, making it more natural, as well as the development of collaborative strategies to allow deaf and LIBRAS experts to improve the quality of translation and presentation of the LIBRAS window (e.g., by editing or adding translation rules or the translation) over time. The language developed for describing rules (presented in Sect. 5.1) is a first step to perform this task.
SignWriting is a writing system for sign languages, but little known by the deaf.
The deaf and interpreter researchers are members of a LIBRAS research group in the education department of the Federal University of Paraiba in Brazil.
This protocol was submitted as a candidate and is being evaluated by the bodies responsible for defining the standards used in the digital television Brazilian system.
The middleware is a software layer responsible for abstracting the specific characteristics of each receiver, allowing the same interactive application to be executed on receivers from different manufacturers .
DCP is a collection of digital files used to store and convey Digital Cinema (DC) audio, image, and data streams.
Ginga-J is the procedural part of the Ginga middleware, the middleware defined by SBTVD. The Ginga-J APIs are based on the Java programming language .
OpenGinga is an open-source reference implementation of Ginga middleware available on http://gingacdn.lavid.ufpb.br/projects/openginga.
I.e., solutions that use virtual animated agents to represent the signs.
The carousel generator and multiplexer equipment used in these tests are compliant with SBTVD and manufactured by Linear Equipamentos Eletronicos (http://www.linear.com.br).
One LIBRAS interpreter was responsible for translating and the other for reviewing.
To support the implementation of this test, a “MPEG-2 TS IP retransmitter” equipment was used to tune the DTV channel and stream it MPEG-2 TS to LIBRAS translator and multiplexer
According to a survey conducted by Fundação Getúlio Vargas (FGV) and Brazilian Association of Radio and Television (ABERT) , Brazilian DTV channels have, in general, diverse programming, which consists of movies, series and soap operas (35.3 %), news programs (20.3 %), children’s programs (14.1 %), variety shows (12.5 %), sports programs (5.0 %), educational programs (2.6 %), comedy shows (2.5 %), religious programs (2.0 %), reality shows (1.4 %), among others.
We tried to find a larger set of users, but, unfortunately, we could not find more volunteers.
We know that one sample video that lasts 65 s is a small sample, but, unfortunately, we do not have enough 3D designers in our laboratory to generate signs to a larger set of samples. We are working on some alternatives, such as the development of a collaborative Web solution, that will allow users to generate signs in a semi-automatic way. It will make the generation of signs faster.
A 1-to-5 scale was chosen because it was also used in the subjective evaluation of other sign language systems [22, 41] and it is widely used in other subjective tests, such as tests of audio and video quality.
We would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) of Brazil and Linear Equipamentos Eletrônicos for the financial support.
- ABNT NBR 15606-1 Specification (2007) Televisão digital terrestre—codificação de dados e especificações de transmissão para radiodifusão digital—parte 1: codificação de dados (Terrestrial digital television—data coding and specifications of transmission for digital broadcasting. Part 1: data coding)Google Scholar
- Amorim MLC, Assad R, Lóscio BF, Ferraz FS, Meira S (2010) RybenáTV: solução para acessibilidade de surdos para TV Digital (RybenáTV: solution to accessibility for deaf in digital TV). In: Proceedings of the XVI Brazilian symposium on multimedia and the web, vol 1, pp 243–248Google Scholar
- Batista CECF, Araujo TMU et al (2007) TVGrid: a grid architecture to use the idle resources on a digital TV network. In: Proceedings of the 7th IEEE international symposium on cluster computing and the grid, vol 1, pp. 823–828Google Scholar
- Batista LV, Meira MM (2004) Texture classification using the Lempel–Ziv–Welch algorithm. Lect Notes Comput Sci 3171: 444–453Google Scholar
- Blakowiski G, Steinmetz R (1996) A media synchronization survey: reference model, specification and case studies. IEEE J Sel Areas Commun 14:5–35View ArticleGoogle Scholar
- Bratko A et al (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698MATHMathSciNetGoogle Scholar
- Brito LF (1995) Por uma gramática de língua de sinais (For a sign language grammar). Editora Tempo Bras, Rio de JaneiroGoogle Scholar
- Buttussi F, Chittaro L, Coppo M (2007) Using Web3D technologies for visualization and search of signs in an international sign language dictionary. In: Proceedings of the 12th international conference on 3D web technology, vol 1, pp 61–70Google Scholar
- Cleary JG, Witten IH (1984) Data compression using adaptive coding and partial string matching. IEEE Trans Commun 32: 396–402Google Scholar
- Cox S et al (2002) Tessa, a system to aid communication with deaf people. In: Proceedings of the 5th international ACM conference on assistive technologies, vol 1, pp 205–212Google Scholar
- Dicionário Ilustrado de Libras do Estado de São Paulo (LIBRAS Illustrated Dictionary of São Paulo) (2006) http://www.acessasp.sp.gov.br/. Accessed 3 Nov 2011
- Dorr BJ, Jordan PW, Benoit JW (2008) A survey of current paradigms in machine translation. Adv Comput 49:1–68View ArticleGoogle Scholar
- Ferreira FLS, Lemos FH, Araujo TMU (2011) Providing support for sign languages in middlewares compliant with ITU J.202. In: Proceedings of the international symposium on multimedia, vol 1, pp 1–8Google Scholar
- FGV/ABERT (2012) Pesquisa sobre TV digital no Brasil (Research on Brazilian digital TV). http://www.abert.org.br/site/images/stories/pdf/TV_Programacao.pdf. Accessed 15 May 2012
- Freitas C, Rocha P, Bick E (2008) Floresta sintá(c)tica: bigger, thicker and easier. In: Texeira A, de Lima VLS, Oliveira LCO, Quaresma P (eds) Conf comput process port lang. Lect notes comput soc. Springer, Aveiro, pp 216–219Google Scholar
- Fusco E (2004) X-LIBRAS: Um ambiente virtual para a língua Brasileira de sinais (A virtual environment for Brazilian sign language). Dissertation, University of Eurípedes de MaríliaGoogle Scholar
- Gallo B et al (2009) Speech into sign language statistical translation system for deaf people. IEEE Latin Am Trans 7:400–404View ArticleGoogle Scholar
- Haddon L, Paul G (2001) Design in the ICT industry: the role of users. In: Coombs R, Green K, Richards A, Walsh V (eds) Technology and the market: demand, users and innovation. Edward Elgar Publishing, London, pp 201–215Google Scholar
- ISO/IEC 13818-6 TR (1996) Information technology—generic coding of moving pictures and associated information: part 6: extension for digital storage media command and controlGoogle Scholar
- Kaneko H, Hamaguchi N, Doke M, Inoue S (2010) Sign language animation using TVML. In: Proceedings of the 9th ACM SIGGRAPH conference on virtual-reality continuum and its applications in industry (VRCAI ’10), vol 1, pp 289–292Google Scholar
- Kennaway JR, Glauert RW, Zwitserlood I (2007) Providing signed content on the Internet by synthesized animation. ACM Trans Comput Hum Interact 14:15–29View ArticleGoogle Scholar
- Kipp M et al (2011) Assessing the deaf user perspective on sign language avatars. In: Proceedings of the 13th international ACM conference on assistive technologies, vol 1, pp 1–8Google Scholar
- Lee DG, Fels DI, Udo JP (2007) Emotive captioning. Comput Entertain 5:3–15Google Scholar
- Lee S, Henderson V et al (2005) A gesture based American sign language game for deaf children. In: Proceedings of the conference on human factors in computing systems (CHI 2005), vol 1, pp 1589–1592Google Scholar
- Libras (2008) Libras—Dicionário da língua brasileira de sinais (Libras—Brazilian Sign Language Dictionary). http://www.acessobrasil.org.br/libras. Accessed 10 Jan 2010
- Lopez A (2008) Statistical machine translation. ACM Comput Surv 40:1–49View ArticleGoogle Scholar
- Mahoui M, Teahan WJ, Sekhar WJT, Chilukuri S (2008) Identification of gene function using prediction by partial matching (PPM) language models. In: Proceedings of the 17th ACM conference on information and knowledge management, vol 1, pp 779–786Google Scholar
- Medeiros TFL, Cavalvanti AB et al (2011) Heart arrhythmia classification using the PPM algorithm. Proc Biosignals Biorobotics Conf 1:1–5Google Scholar
- Moffat A (1990) Implementing the PPM data compression scheme. IEEE Trans Commun 38:1917–1921View ArticleGoogle Scholar
- Morris S, Smith-Chaigneau A (2005) Interactive TV standards: a guide to MHP, OCAP and Java TV. Elsevier/Focal Press, AmsterdamGoogle Scholar
- Morrissey S (2008) Data-driven machine translation for sign languages. PhD thesis, Dublin City UniversityGoogle Scholar
- Othman A, Jemni M (2011) Statistical sign language machine translation: from English written text to American sign language gloss. Int J Comput Sci Issues 8:65–73Google Scholar
- Papineni K, Roukos S, Ward T, Zhu W (2001) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, vol 1, pp 311–318Google Scholar
- San-Segundo R et al (2008) Speech to sign language translation system for spanish. Speech Commun 50:1009–1020View ArticleGoogle Scholar
- San-Segundo R et al (2011) Development and field evaluation of a Spanish into sign language translation system. Pattern Anal Appl. doi:10.1007/s10044-011-0243-9
- Santos GS, Silveira MS, Aluisio SM (2009) Produção de textos paralelos em língua portuguesa e uma interlíngua de LIBRAS (Production of parallel texts in Portuguese and a LIBRAS interlanguage). In: Proc XXXVI semin integr softw hardw (SEMISH), vol 1, pp 371–385Google Scholar
- Sousa Filho GL, Leite LEC, Batista CECF (2007) Ginga-J: the procedural middleware for the Brazilian digital TV system. J Braz Comput Soc 12:47–56View ArticleGoogle Scholar
- Souza VC, Vieira R (2006) Uma proposta para tradução automática entre Libras e português no sign webmessage (An proposal for automatic translation from Libras to portuguese in Sign Webmessage). In: Proceedings of the 19th Brazilian symposium on artificial intelligence, vol 1, pp 1–10Google Scholar
- Starner T, Pentland A, Weaver J (1998) Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans Pattern Anal Mach Intell 20:1371–1375View ArticleGoogle Scholar
- Stokoe WC (1980) Sign language structure. Annu Rev Anthropol 9:365–390View ArticleGoogle Scholar
- Su HY, Wu CH (2009) Improving structural statistical machine translation for sign language with small corpus using thematic role templates as translation memory. IEEE Trans Audio Speech Lang Process 17:1305–1315View ArticleGoogle Scholar
- Veale T, Conway A, Collins B (1998) The challenges of cross-modal translation: English to sign language translation in the Zardoz system. Mach Transl 13:81–106View ArticleGoogle Scholar
- Wang Q, Chen X, Zhang LG, Wang C, Gao W (2007) Viewpoint invariant sign language recognition. J Comput Vis Image Underst 108:87–97View ArticleGoogle Scholar
- Zhao L et al (2000) Machine translation system from English to American sign language. In: Proceedings of the fourth conference of the association for machine translation, vol 1, pp 293–300Google Scholar