- Open Access
Performance evaluation of single- and multi-hop wireless networks-on-chip with NAS Parallel Benchmarks
Journal of the Brazilian Computer Society volume 21, Article number: 6 (2015)
Parallel processing in the era of many-core processors demands high-performance networks-on-chip and parallel communication based on intra-chip message passing. In this context, wireless networks-on-chip (WiNoCs) emerge to improve inter-core communication, bringing high bandwidth and low power consumption. In order to increase application performance, WiNoCs can support single-hop or multi-hop communication. The main issue is related to the performance of each communication architecture in the face of different parallel workloads. For this reason, the goal of this work is to evaluate and compare single- and multi-hop WiNoC architectures using parallel applications.
The methodology is based on simulations done using the well-known simulator Network Simulator 2 (NS-2) and applications from NAS Parallel Benchmarks (NPB). The WiNoC architectures are based on 2-D mesh topology and ultra wide band (UWB) radio technology. The inter-core transmission is evaluated concerning unicast communication (1:1 and N:1) and broadcast communication (1:N and N:N). This work has, as contribution to the state-of-the-art, the evaluation of both WiNoC designs (single- and multi-hop architectures) with parallel applications.
Based on our results, the single-hop architecture has lower communication delay than the multi-hop version, and for some workloads, there were no packet losses. However, to achieve high-performance communication, the single-hop architecture consumed 63.12 J for the 256-node network, versus 0.22 J consumed by the multi-hop version.
Although single-hop WiNoCs reduce network bottlenecks and increase communication parallelism, they are recommended when energy consumption is not a critical factor.
The search for better performance in computers and the limits of single-core architectures, like power consumption and restrictions in instruction-level parallelism, favored the emergence of multi- and many-core architectures [1–3]. In this type of architecture, a processor is composed by several cores and each core is able to process more than one instruction flow (thread) belonging to an application. Therefore, parallel applications make better use of the performance capabilities of multi- and many-core architectures, since concurrent parts of a program can be executed simultaneously, reducing the program’s total execution time.
In order for results to be generated during parallel processing, information exchange between threads is often needed. Messages can be exchanged in the following communication patterns: 1 to 1, 1 to N, N to 1, and N to N [4, 5]. An interconnection between cores must exist for communication to occur.
Busses and crossbar switches are vastly used to connect the cores in a multi-core processor [6, 7]. These solutions are not applicable to many-core architectures (processors with a very high number of cores), because the increased number of cores demands an increase in wire length . Lengthy wires generate more latency and electrical resistance and make routing packets between cores difficult, making the use of buffers and repeaters necessary.
As a means for providing a more efficient core interconnection, networks-on-chip (known as NoCs) were proposed [9, 10]. NoCs are responsible for establishing packet transmission-based communication between cores. In this approach, each core is associated to a router connected to other routers using wires through which packets travel to their destination. Networks-on-chip mitigate wire-related problems and provide communication-level parallelism.
The introduction of NoCs has reduced message delivery delay and network power consumption. However, since wires were not completely eliminated, they can still cause problems. Due to these wire-related problems, wireless networks-on-chip (WiNoCs) were proposed. In WiNoCs [11, 12], routers are associated to cores and radio signals are emitted by antennas (transceiver/receiver) to provide communication. Thus, high-bandwidth message exchange between cores is possible, with reductions in power and energy consumption [13, 14].
In parallel processing, performance is expected to increase in the same proportion as the increase in number of processor cores. However, this is not always possible in practice, due to scalability limitations in applications, communication limitations generated by programs during execution, and network restrictions.
It is believed that single-hop WiNoCs may favor some applications by reducing packet delivery latency, as a result of a reduction in router workloads for packet hops in communication between non-adjacent nodes. Additionally, network overloads caused by unnecessary retransmissions so broadcasts can reach every node are also reduced. On the other hand, by using multi-hop communication, transmitters, receivers, and signals can use less power, reducing total energy consumption.
Our previous work  described the performance evaluation of single-hop WiNoCs. It was extended in order to deepen the evaluation encompassing the comparison with multi-hop WiNoCs. In this regard, the goal of this work is to evaluate the performance of single-hop and multi-hop WiNoCs by simulating the execution of parallel workloads in these architectures. Another goal is to compare the results for both architectures, pointing out pros and cons for each one. The contribution of this article, therefore, is the proposition of single-hop and multi-hop WiNoCs as an intra-chip communication alternative in the context of parallel applications and the WiNoC design method and evaluation for the related state-of-the-art.
NoCs are communication networks based on packet exchange, used to interconnect the cores of a many-core chip [9, 16]. The main components of a NoC are: a) routers – responsible for packet forwarding, according to a chosen routing protocol, until packets reach their destination; b) network interfaces – responsible for connecting each core to a router; c) links – short wires between routers that consist of an interconnection medium.
Thread-level parallel applications explore the performance of a multi-core architecture the best. When executed, these applications are split into many instruction flows (threads) that will be processed by the architecture’s cores. During execution, threads communicate using the network-on-chip according to collective communication patterns (from an application behavior point of view) defined by Duato . Figure 1 presents the main communication patterns, which are: a) 1:1 (one-to-one) – a core in the network sends a message to a single other core (unicast); b) 1:N (one-to-all) – a core sends a message to every other core in the network (broadcast); c) N:1 (all-to-one) – all of the cores send a message to a designated core (every core sends a unicast message); d) N:N (all-to-all) – all of the cores send a message to all the other cores (every core sends a broadcast message).
Freitas et al.  highlight the importance of knowing how parallel applications explore the collective communication patterns. This concept is regarded as important in the inception of general or specific-purpose NoCs, so that programs can obtain the best possible performance in a given architecture.
Ganguly et al.  show the increasing level of integration between cores in a NoC, which originates many-core architectures. This makes the limitations that arise from the use of wires become relevant again. These restrictions represent a hindering factor for NoCs, because communication between distant cores occurs in multiple hops, which naturally use more wires in message exchange, increasing latency and energy dissipation. WiNoCs were proposed in an attempt to solve the latency and energy problems, eliminating wire-related obstacles and increasing bandwidth at the same time.
Like NoCs, WiNoCs have routers physically connected to cores. As shown in Fig. 2, the difference is the network interface between routers, which in WiNoCs is an antenna (transceiver/receiver) that emits radio frequency signals to provide communication, reducing energy dissipation and packet transmission latency.
WiNoCs were created with the goal of solving the high-energy dissipation and high communication latency problems that emerged from the evolution of multi-core architectures into many-core architectures. Because the establishment of communication between cores in a chip is still a recent paradigm, many research works with distinct purposes are being conducted by the scientific community, targeting the development of more efficient and scalable WiNoC architectures.
Our previous works [15, 18, 19] conducted WiNoC studies using Network Simulator 2 (NS-2). Oliveira et al.  executed simulations based on synthetic workloads to generate communication between cores. In the other aforementioned works [15, 19], network traffic was generated by means of existent communication in parallel applications.
Li  reports some of the existing differences in, for example, WiNoC protocols, hardware, and topology. Concerning topologies, according to the author, both “pure” and hybrid (mixing wired and wireless connections) topologies can be found.
According to Ganguly et al. , if some or all wired connections in a NoC are replaced with wireless high-bandwidth single-hop connections, there will be a reduction in energy dissipation and latency, since these are caused by the multiple hops necessary to accomplish message exchange between distant cores. In order for the distance between cores not to increase a lot, it is suggested to split the network into smaller sub-networks as it grows. The cores in a sub-network can be interconnected via wires, since they will be physically close, and the communication between sub-networks is wireless. Also, Ganguly et al.  present a performance evaluation comparing NoC architectures with two different hybrid architectures (one was designed based on mesh topology and the other had a ring-star topology). For the comparisons, a 64-core architecture was simulated in a cycle-precise network simulator. The workload was generated by packet injection, with uniformly distributed spatial traffic. The authors concluded that the proposed hybrid networks can achieve higher performances than the compared NoC architectures.
Carloni et al.  describe the opportunities and challenges of three emerging core interconnection technologies. The discussed technologies are 3-D topology, nanophotonic (optical) communications, and wireless connections. Regarding wireless connections, the advantages of hybrid solutions were emphasized . The authors finished by stating that the three studied technologies are promising solutions to the traditional NoC problems, but more research is necessary to overcome the multiple challenges like system architecture and device manufacturing.
Pande et al. also analyzed hybrid NoC architectures . The simulated networks had 128, 256, and 512 cores, and for all of them the number of cores in each sub-network was set to be 16, and only the number of sub-networks varied. Sub-network cores were connected in a mesh topology using wires. Wired and wireless connections were used to connect sub-networks. The number of wireless connections varied between 1, 6, and 24. Initially, the wireless links were randomly distributed and the distributions that obtained the best results were later determined. The conclusion was that inserting long-ranged wireless connections in a NoC significantly improves its performance. Additionally, the performance gains are more significant as the system grows in size.
A comparative study between wireless and optical NoCs is presented by Deb et al. . The experiments were conducted in architectures of three sizes: 128, 256, and 512 cores; and with two workloads: uniform random traffic and specific traffic applications. The latter is composed by four distributions: HotSpot, Transpor, Fast Fourier Transform, and Matrix. The authors concluded that WiNoCs have higher bandwidth for all evaluated traffic patterns.
Another work proposing long-distance single-hop wireless communications in a NoC is presented by Wang et al. . To analyze the proposed network, an architecture with 100 cores split into 4 sub-networks was simulated in a System-C-based simulator. Aiming to apply the temporal and spatial distribution behaviors of practical applications, the authors used three traffic-generating techniques: burstiness, injection with Gaussian distribution, and hop distance. Simulation analysis indicated that the addition of wireless routers between the sub-networks generated throughput improvements, in addition to reducing energy dissipation and latency.
Zhao et al.  propose an architecture they denominated McWiNoC (multi-channel wireless network-on-chip), consisting of a WiNoC with ultra-short connections using multi-channel UWB radio technology. In this approach, communication between adjacent cores is multi-hop. The authors also developed a routing algorithm based on core location and a method to prevent deadlocks. In order to evaluate the designed architecture, the authors developed a simulator. In the simulations, synthetic workloads were used. Results showed that McWiNoC had better performance in comparison with traditional NoCs.
A network split into sub-networks and interconnected using wired and wireless connections is also shown by Ganguly et al. . The proposal consists of sub-networks using mesh topology in which the cores are connected to each other and to a central hub by wires. In the second level of the network, all of the sub-networks are also connected by wires in a ring topology. Initially, the available wireless connections are probabilistically distributed among pairs of hubs, based on the distance between them, measured in number of hops. After network start-up, the simulated annealing heuristic is used to reassign wireless connections for performance optimization. Tests were conducted with 128, 256, and 512 simulated cores, varying the number of sub-networks and the number of cores per sub-network. Results showed that the proposed network achieved better transfer rates and lower energy dissipation and delay.
No studies were found in the literature that evaluated multi-hop and exclusively single-hop WiNoC architectures using workloads generated by parallel applications. Only synthetic workload-based evaluations were found, and the single-hop WiNoC proposals are part of hybrid architectures. Thus, this work has, as state-of-the-art contributions, the two proposed WiNoC architectures (multi-hop and single-hop) with parallel workloads from NAS Parallel Benchmarks (NPB), along with their design method and evaluation. Besides, the results indicate which approach to use for prototyping WiNoCs.
The performance evaluation methodology for single- and multi-hop WiNoCs was based on the simulation model and divided into three stages. The first stage involved the selection and preparation of workloads, while the second stage consisted in configuring an environment capable of simulating WiNoC architectures. In the third and final stage, the performance evaluation of simulated WiNoCs with parallel workloads was made.
Selection and preparation of workloads
To evaluate the behavior of the proposed WiNoCs regarding communication in existing parallel applications, we decided to use some of the NPB kernels . This choice was due to these applications being indicated to evaluate the performance of parallel supercomputers.
The chosen kernels were the following: Conjugate Gradient (CG), Embarrassingly Parallel (EP), Fast Fourier Transform (FT), Integer Sort (IS) and Multi-Grid (MG), configured with class A problem sizes , and Message Passing Interface (MPI) programming model.
These applications accurately represent the traffic pattern in a WiNoC, since they cover all collective communication patterns. As shown in Table 1, the 1:1 communication pattern is well represented by the CG and MG workloads. The N:N communication pattern is significantly present in the EP and IS workloads. N:1 communication makes up 42.38 % of the FT workload, which is the most balanced between unicast (N:1) and broadcast (N:N) communication. Although the 1:N communication pattern is not abundant in the workloads, it is part of N:N communication. In addition to the aforementioned kernels, NAS is also composed, for instance, by the Scalar Penta-diagonal (SP), Lower Upper Gauss-Seidel (LU) and Block Tridiagonal (BT) applications with similar characteristic and distribution of collective communication. They were not used due to their size, rendering them not viable to be run on the simulator for comparisons between single-hop and multi-hop WiNoCs.
This stage had the goal of looking for alternatives to simulate WiNoC architectures. Due to the recent proposition of this interconnection model, there are no simulators available that focus exclusively on this type of architecture. Some authors  that conducted simulations using WiNoCs opted to develop their own simulation tools, which were designed to do very specific tasks and therefore are not applicable to this work.
For this reason, the alternatives were to either use NS-2 (widely used network simulation tool) and adapt it to the intra-chip context or to adapt an existing NoC simulator to work with wireless communication. Although NS-2 has been used in other works in the last few years to simulate NoCs [27, 28], we found no studies that used it in WiNoC simulations. In spite of that, NS-2 (version 2.29) was proven to be more favorable due to the associated difficulties to implement, in a short period of time, all the layers and protocols that constitute a wireless network in a NoC-specific simulator.
NS-2 is a network simulator based on discrete events, popular among academics for being free and open source. The project started at Berkeley University and has received the collaboration of many researchers. It is based on the OSI model network layers, and several technologies were implemented on it, such as queuing policies, routing protocols, transport agents (TCP - Transmission Control Protocol and UDP - User Datagram Protocol), wireless network protocols, and traffic-generating applications. Furthermore, it also has a graphical interface for network visualization, called Network Animator (NAM) .
NS-2 was implemented in two programming languages: C++ and Otcl. Its core was written in C++, because it is a more robust and dependable language, allowing for efficient, lower-level code to be developed. The simulation scenario configuration module was written in Otcl, which is an object-oriented, interactive, interpreted TCL script language. With it, configuration parameters can be altered for new simulations without recompiling the entire simulator code.
To configure an NS-2 simulation, a TCL script is used. In this script, users define the number of nodes in the network, network topology, link type, transport agent, routing protocols, type of traffic, among others.
NS-2 generates a network trace file as output, which is like a log. In this file, information about every packet that traveled through the network is stored. This information encompasses the type of event (packets being sent, received, or lost), timestamps, origin and destination addresses, energy consumption by the node involved in the event, identifier, packet size and type, an identifier for the message flow to which the packet belongs, among others.
The WiNoC simulations were evaluated according to the well-known, established metrics in network research: packets sent (percentage of sent packets), packets lost (percentage of lost packets), injection rate (average number of bits injected in the network per second), throughput (average number of bits received by the network nodes per second), delay (packet delay from the origin to the destination in milliseconds), and energy consumption (per node and for the entire network, in joules).
Preparation of parallel workloads
The kernels selected from NPB were set up and executed with 4, 8, 16, 32, 64, and 256 processes in a multiprocessor cluster in order to record communication information during execution (sender node, destination node, timestamp, type of transmission, etc). These traces were manipulated to generate the input traffic files for NS-2 simulations.
The draft of the multi-hop WiNoC architecture was conceived by Oliveira et al. . To design the single-hop WiNoC architecture, we opted to change only what was necessary so that communications stopped occurring in multiple hops and were carried out in a single hop. The performance of both architectures can then be compared based only on this difference. Table 2 presents the main features of both architectures and their differences.
Both the single- and multi-hop architectures use mesh network topologies. The radio technology for the WiNoCs is ultra wide band (UWB). Buffer size was fixed as 10 packets and each packet have a maximum size of 38 bytes according to the state-of-the-art . What differentiates the single-hop WiNoC from the multi-hop network is the fact that the single-hop architecture does not use routing protocols, while the multi-hop architecture uses the XY protocol. Power expenditure was also altered, since communication range in the single-hop network must be larger due to the fact that communication must occur in only one hop, with no intermediate nodes. Thus, the power used in transmitters, receivers, and signals is higher in the single-hop WiNoC.
Power usage in multi-hop and single-hop communication
After some test simulations with the multi-hop WiNoC, the values of 0.9 mW for transmitter and signal power and 1.6 mW for receivers were shown to be viable for simulations in NS-2. These values were enough for the emitted signal to reach neighboring nodes in a mesh topology network with a 1-mm distance between nodes.
For communication between nodes in a WiNoC to be single hop, the employed transmitting signal (Impulse Radio UWB) must be propagated to the entire network, reaching every node. This way, communication between all of the nodes can be achieved in a single hop. For such, we identified three WiNoC configuration parameters that needed to be altered: transmitter power, receiver power, and signal power. We decided to maintain the proportion between the previously established values (multi-hop version), so we multiplied them by the same constant when increasing them.
Aiming to find the lowest possible power values so that the signal could still reach every node in the network, no matter where the origin and destination nodes were located, tests were conducted using synthetic workloads to test packet transmission between nodes, especially unicast communication between the nodes that were the farthest apart. Broadcast communication was also tested. These tests were conducted for all of the different network sizes, and power values were altered until the desired values were determined according to the aforementioned criterion. Table 3 presents the final values obtained through the simulations for each network size.
The following adaptations were made to NS-2 in order to better support WiNoC simulations: a) Integration of a UWB radio technology implementation We integrated into NS-2 (version 2.29) the only available implementation of Impulse Radio UWB (IR-UWB) from the simulator’s contributed code website [30, 31]. This implementation is an NS-2 plug-in. It adds the Dynamic Channel Coding MAC (DCC-MAC) layer and the IR-UWB physical layer (InterferencePhy - PHY) to the simulator. The physical layer uses time hopping to enable simultaneous transmissions in sub-channels, pulse-position modulation (PPM) for signal modulation, convolutional code for channel encoding, and a calculation based on bit error rate (BER) to obtain the error rate for each packet. A propagation model for UWB channels was also incorporated [32, 33]. b) Addition of a Fast Broadcast module For broadcast communication to occur in an ad hoc network with multi-hop routing, each node must, upon receiving a broadcast packet, forward it to its neighbors so every node in the network will receive the packet. This strategy causes network overloads due to the high number of packets generated in every broadcast forwarding. This is why the Fast Broadcast module  was added to the simulator. The module is an NS-2 extension and implements an algorithm designed to reduce the number of forwardings. To initiate communication, Fast Broadcast uses a module-specific application to generate network traffic. This was not viable in WiNoC simulations, due to the use of an application (TrafficTrace) that schedules packet transmissions by reading trace files. As a solution, a modification was made so that Fast Broadcast was triggered directly from TrafficTrace by an agent . c) Modification of the file-based traffic generation mechanism The TrafficTrace application generates network traffic by reading binary input files containing two fields to represent transmissions. The first field specifies the time interval, in milliseconds, after which transmissions can occur; the second field defines the size of packets that will be sent. The application proved to be adequate to generate traffic compatible with NPB application workloads. To accomplish this, some modifications  needed to be made. The first one consisted in altering the first field of the file so that it would represent the exact moment in which the packet would be sent. The application was also altered so it would generate transmissions from the files only once instead of for the entire duration of the simulation, as was done previously. Further adaptations included the following: the field representing the time in which to send a packet has its unit changed to seconds; the input file type was altered to text from binary, so larger time moments could be represented; a third field was added to the file to determine what kind of agent must be used when sending a packet according to communication type (0 for unicast and 1 for broadcast). d) Adjustments in the physical layer Some adaptations were made to the class Phy/WirelessPhy/InterferencePhy. In its original state, the class randomly picks a packet to be received from the synchronization list. This is not viable when there is intense traffic in the network, because rescheduling problems may happen if the predicted acquisition time of a packet expires, which generates a fatal error causing the simulation to crash. To avoid this problem, the class was altered to always choose the last packet to be inserted into the synchronization list, since it has a smaller probability of having an expired predicted reception time. Another necessary modification was the value of the sync_thresh variable. According to Merz et al. , this variable sets the sensitivity level for the network, relative to the energy needed to detect reception signals. Its value was changed to 10 dBm from the original −84 dBm, because it is the maximum value  recommended by the IEEE 802.15.4a standard document. e) Adaptations to the Fast Broadcast module to simulate broadcast transmissions in single-hop WiNoCs The Fast Broadcast module that was integrated into the simulator is responsible for making broadcast transmissions in multi-hop networks. Using an optimization algorithm, some routers are selected to retransmit received packets that correspond to broadcast messages to their neighbors, making packets get to every node in the network. For the module to also be used in single-hop WiNoC simulations with broadcast communication with no packet retransmissions, we studied the module and found out that retransmissions were generated in the process_data_BroadcastMsg method, called after receiving a packet with the recv method of the BroadcastAgent class. The recv method was altered to no longer call the process_data_BroadcastMsg method, so retransmissions would stop. Synthetic workloads were used in simulations in order to test the modifications. Results showed that the modifications did not affect broadcast communication and that packets were no longer being retransmitted. The Fast Broadcast module could then remain being used to generate 1 to N communications without packet routing in single-hop WiNoC simulations. It is important to emphasize that the modifications were only made for single-hop WiNoC simulations. For multi-hop WiNoCs, the unaltered Fast Broadcast module was used.
Results and discussion
Our performance evaluation of single-hop WiNoCs is based on unicast communication used for 1:1 and N:1 communication patterns and broadcast communication for 1:N and N:N patterns for the different workloads.
The results presented in the following subsections depict single-hop and multi-hop WiNoC simulations with 4, 8, 16, 32, 64, 128, and 256 cores. These results were generated according to the previously defined evaluation metrics and evaluated taking into consideration the characteristics of the simulated architectures, along with the behavior of each workload. The proposed WiNoCs were evaluated with different sizes to investigate the architectures’ scalability. Multi-hop simulations have a non-deterministic behavior due to the Fast Broadcast module. For this reason, each scenario (architecture/workload) was simulated 33 times. All scenarios have a 99 % confidence interval.
Packets sent and lost
Figure 3 allows us to infer, from the high percentage of unicast packets sent for the CG workload, that point-to-point communications are dominant. Even though the figure shows that, for networks with eight or more nodes and single-hop architecture, communication is 100 % unicast, broadcast communication (Fig. 4) is also present in these scenarios but had their percentages rounded down to zero for being too small compared to the total. As expected for this workload, the percentage of sent packets of both types is very close for both architectures (single- and multi-hop), which can be explained by the almost complete lack of broadcast communication.
Communication in the CG workload occurs predominantly between adjacent network nodes. Figure 5 shows a maximum packet loss of 0.02 % for single-hop WiNoCs, which shows that even though communication was concentrated around a few node pairs, no bottlenecks were generated for the workload. For the multi-hop architectures with non-quadratic mesh topologies (8, 32, and 128 cores), loss rates of up to 50 % were observed, which occur due to nodes being busier making packets hop.
In spite of the amount of communication for the EP workload not being very significant when compared to the remaining studied workloads, it is important to analyze them because communication is 100 % broadcast, as shown in Fig. 4. The single-hop WiNoC architecture is shown to be extremely favorable, relative to packet losses, to workloads that operate exclusively with broadcast communication, since no losses were detected in simulations of these types of workload (Fig. 6). Differently, due to routers being burdened with retransmissions and greater network overload, 9.92 % of the packets were lost in the 256-core multi-hop architecture. Still in accordance with Fig. 6, the multi-hop WiNoC architecture is not scalable regarding packet losses.
The FT workload is composed by over 55 % of broadcast transmissions. Unicast transmissions are generated by the N:1 collective communication pattern. In this pattern, N unicast transmissions are carried on at the same time for a single destination. As can be seen in Figs. 3 and 4, for single-hop WiNoC scenarios, the percentage of sent packets of each type is more balanced than in the multi-hop architecture, in which each broadcast communication generates many packet transmissions, vastly increasing the percentage of broadcast packets sent.
Like with the EP workload, single-hop WiNoCs have excellent broadcast performance with the FT workload since there were no packet losses, as seen in Fig. 6. However, unicast communication (Fig. 5) suffered from elevated, increasing losses. Such losses derived from the nodes competing to send packets to the destination node. A node cannot receive more than one packet at a time, leading to just one of the N packets arriving at a given moment in an N:1 communication to be received. Even then, the single-hop WiNoC had fewer lost packets than the multi-hop, which has increased competition for routers and network traffic due to packets being resent.
In the IS workload, communication is predominantly of the broadcast type. As Fig. 4 indicates, this predominance is greater than 90.98 % for the single-hop WiNoC scenarios. The workload’s unicast communication (Fig. 3) belongs to the 1:1 and N:1 collective communication patterns, the latter being the most common. Broadcast communication rates are even higher for the multi-hop WiNoC, with a minimum rate of 96.54 %.
The IS workload also suffers from high packet losses, as a consequence of transmissions pertaining to the N:1 collective pattern, as evidenced by Fig. 5. For this workload, the increase in communication incurs a loss, albeit discrete, of broadcast-derived packets. Packet losses for the single-hop WiNoCs were smaller for both unicast and broadcast communication when compared to the multi-hop networks.
Still according to Figs. 3 and 4 and as opposed to the FT and IS workloads, the MG workload is predominantly comprised by unicast communications, a small portion of which are N:1. Such predominance is more evident in the single-hop WiNoC, in which broadcast communications occur in a single hop.
Packet losses referent to the unicast communications for the MG workload, as seen in Fig. 5, come from N:1 transmissions. The broadcast-derived losses, on the other hand, happen as a consequence of the increased number of communications, which provokes more packet collisions and increases competition for the transmission medium. For the MG workload, the single-hop WiNoC presented a smaller loss percentage than the multi-hop architecture. This fact can be justified by the less intense network overload caused by single-hop broadcasts.
Injection rate and throughput
For the CG workload in scenarios with more than 16 nodes, unicast communication is concentrated in the beginning of simulations and becomes more uniform and sparse during the rest of the time. The fact that this peak exists causes communication to be reduced during most of the time, consequently reducing the average injection rate in the network. This explains the drop in injection rate for both architectures observed in Fig. 7. Concerning broadcast communication (Fig. 8), the scenarios for the single-hop architecture have a smaller injection rate because retransmissions that would cause more packets to be injected into the network are avoided.
Figure 9 shows that, concerning network throughput for the CG workload, the single-hop WiNoC had excellent performance by keeping throughput rates equal to the injection rates for unicast communication. For broadcast communication (Fig. 10), the throughput is naturally higher because each transmitted message is received by every node in the network. In the scenarios with 4, 8, 16, 32, 64, and 128 routers, throughput for the single-hop architecture is smaller than for multi-hop because there are no retransmissions for packets to reach every node. The network with 256 routers, however, has higher throughput in the single-hop architecture due to the higher number of packets from broadcast communication being received at the same time by more nodes.
Figure 8 allows us to observe that, for the EP workload, the increase in communication as the number of network nodes grow causes the injection rate to also grow. The reduction in injection rate for the 256-node single-hop WiNoC occurs due to the slight peaks in communication distribution during simulation time, which mildly reduces average injection rates. As evidenced by Fig. 8, packet retransmissions generated by the multiple hops in multi-hop architectures cause the injection rates to be much higher in the corresponding scenarios in comparison with single-hop architectures.
Throughput deriving from broadcast communication for the EP workload also grows as the number of nodes is augmented. In the single-hop WiNoC, this growth is up to 200 times higher than the injection rate. This behavior is explained by every node receiving every packet sent in the network. All of the routers in a single-hop WiNoC simultaneously receive a broadcast message when it is transmitted. Therefore, throughput is higher than in a multi-hop network for sizes starting from 64 nodes, considerably increasing with the addition of more nodes. For the scenarios simulating networks with 4–32 nodes, packet retransmissions make the multi-hop WiNoC have more throughput.
Figure 7 shows, additionally, that injection rates stemming from unicast communication for the FT workload are extremely low, reaching a maximum of 3.5 kbps in the single-hop WiNoC. This occurs due to the small amount of communication of this type. Since broadcast communication is also not abundant in this workload, injection rates (Fig. 8) derived from them are also not elevated, although greater, reaching 12.69 kbps. For the multi-hop WiNoC, injection rates for broadcast communication are notably higher due to the multiple hops.
Throughputs for unicast communication in the FT workload, as seen in Fig. 9, were naturally smaller relative to injection rate, due to high packet losses. Despite low unicast throughput rates, broadcast communication achieved a high growth for the scenarios with the most nodes. The increase is drastic when compared to injection rates. On the 256-node single-hop WiNoC, it was over 237 times higher, which represents an increase by a factor of almost the number of existing nodes. This corresponds to the expected behavior for broadcast communication. As was the case with the EP workload, throughput for the FT kernel in single-hop networks with 64 or more routers is higher than in corresponding multi-hop networks, because of broadcast communication.
Figure 7 indicates that injection rates derived from unicast communication for the IS kernel are low, but are higher than the ones obtained with FT, which is justified by the presence of 1:1 communication. Broadcast communication obtained higher rates due to being predominant in this workload. Multi-hop WiNoCs have higher injection rates with IS because of the multiple hops.
Throughput for the IS workload has the same behavior as for FT. The reasons behind this are aforementioned N:1- and broadcast-related characteristics. For this workload, the single-hop WiNoC obtained better throughput rates in scenarios with more than 16 routers. Such a result is natural due to the increase in broadcast packet reception caused by the greater number of network nodes.
For MG workloads, communication was more concentrated during the beginning of the simulations and tended to be more sparse in larger, longer-lasting network simulations. This is the reason why injection rates for unicast communication, shown in Fig. 7, are irregular throughout the various network sizes. The same pattern can be observed in broadcast communication (Fig. 8). Single- and multi-hop WiNoCs had practically the same injection rate for unicast communication, due to low interference from broadcast communication. Regarding broadcasts, injection rates were lower in the single-hop WiNoC than in the multi-hop network.
In the single-hop WiNoC, unicast communication from the MG workloads had very close throughput and injection rate values. Throughput is the smallest of the two, due to packet losses in N:1 communication. Concerning broadcast, throughput was similar to what was observed for other workloads, that is, it grows as the network grows, for the same aforementioned reasons. For the MG workload, throughput for networks with more routers is also higher in the single-hop architecture, since more packets are received simultaneously.
Communication delay and energy consumption
Figure 11 shows average delay values for unicast communication in both single- and multi-hop WiNoCs. We can see both architectures presented a non-uniform behavior pertaining to average delay values for unicast communication in the CG workloads, relative to node quantity. This happens because communication occurs among just a few nodes in the network, and the distance between these nodes can vary according to the way in which they were distributed in the architecture. It can also be noted that the single-hop WiNoC had better performance, because it relays messages directly to the destination nodes.
For the FT and IS workloads, delay behavior in unicast communication is also non-uniform. This is justified based on characteristics of N:1 communication. During packet exchanges, if more than one node wishes to send a packet to a destination node at the same time, a selection to decide which packet will be received is carried out, since the router can only receive one packet a time. The other packets are then discarded, which affects packet loss percentages, as previously stated. When the selected packet originates from the receiving router itself, the delay is practically null but otherwise a negotiation between the involved nodes takes place, increasing communication delay.
Smaller delay values were obtained by the multi-hop WiNoC in unicast communication for these workloads. The high delay in the single-hop WiNoC for this type of communication refers to more transmissions, since the multi-hop architecture had higher packet loss rates. Besides, delay also increases with high medium utilization, caused by unicast messages being emitted to the entire network, independently of the positions of the origin and destination routers.
Delay values for unicast communication using the IS workloads were higher in the networks of sizes 8 and 128, which occurs due to negotiations to select which requesting node will be granted permission to send a packet in N:1 communication. Possibly, since communications of this type were lost in networks of other sizes, delay values were lower, pertaining only to 1:1 communication.
For the MG workload, unicast delay values for the single-hop WiNoC begin increasing with network sizes in networks with 8 or more routers. It is an expected increase, due to the high number of packet exchanges. This fact is also observed in the multi-hop WiNoC, although it is less expressive in the quadratic mesh topology (4, 16, 64, and 256 routers) scenarios. Even though the CG workload has more communication, delays for the MG workload are bigger because it is executed in less time, which increases competition for the transmission medium. Like in the FT and IS workloads, delays in the single-hop architecture surpassed those in the multi-hop WiNoCs.
Figure 12 shows the average delay values for both single- and multi-hop architectures in broadcast communication. Delay values for every workload increased with network size and were proportional to the amount of communication that took place. Delays for the CG workload are significantly lower than others in the single-hop WiNoC, due to the small number of broadcast transmissions. The single-hop WiNoC had considerably smaller delay values from broadcast transmissions for all workloads, as a consequence of carrying them out in a single hop.
As a consequence of the increase in transmitter, receiver, and signal power, so that single-hop communication is viable, energy consumption values are considerably higher in single-hop WiNoCs when compared to multi-hop networks. Figure 13 shows average energy consumption values for each node in the single- and multi-hop WiNoCs. Energy consumption in the simulations increases with network size, and consumption per workload is associated with the number of packet transmissions in each of them. This occurs because power values are the same no matter what the distance between the nodes may be, so power values must be set to accommodate the highest distance.
The single-hop WiNoC was more scalable relative to packet losses in 1:1 unicast communication, presenting no losses for the CG workload and slightly increased losses for the MG workload. In the multi-hop architecture, losses got bigger as the number of cores was increased, but scalability is more critical in nonsquare-shaped mesh networks, which have significantly larger packet loss rates. Pertaining to N:1 communication, losses grew significantly as networks got larger in both architectures, but the growth was steeper in multi-hop architectures, rendering them less scalable.
Scalability relative to predominantly 1:1 unicast communication latency (CG) is better in the single-hop WiNoC. Conversely, the MG workload scaled better in the multi-hop WiNoC, but since packet loss rates were considerably smaller in the single-hop architecture, the better scalability is not regarded as an advantage, because the increased latency in the single-hop architecture is explained by delays in the delivery of packets that were lost in the multi-hop network.
In broadcast communication, the scalability of the single-hop WiNoC relative to packet losses is excellent, given the absence of packet losses for the EP workload and, respectively, the low and decreasing loss rates for the IS and MG workloads as the number of network nodes increases. Loss rates on the multi-hop WiNoC, on the other hand, increase as the network grows. Scalability relative to broadcast communication in workloads that rely predominantly on this pattern (EP and IS) is better in the single-hop WiNoC, due to the smaller inclination of its curve as the number of cores increases.
In addition to smaller energy consumption, the multi-hop WiNoC’s inclination curve is smoother as the number of cores increases, showing its better scalability compared to the single-hop network.
WiNoC simulation results for workloads using predominantly unicast communication (CG and MG) showed that the single-hop WiNoCs have excellent performance, with low packet loss and delay values. Even though the MG workload had higher delay for the single-hop architecture (Fig. 11), this was not considered a negative point compared to the multi-hop architecture, since delay values for broadcast packets (Fig. 12) still show a better performance for single-hop architectures.
For the FT and IS workloads, high packet losses in unicast communication were recorded for both architectures, being more accentuated in the multi-hop WiNoC. The registered delays were also high, with the single-hop architecture being the slowest of the two. The high latency values can be attributed to N:1 communication, whose packets arrive simultaneously, and the higher values observed in the single-hop architecture can be explained by its lower packet loss rates.
Regarding broadcast communication, the simulated single-hop WiNoC architecture was shown to be extremely favorable to them. For the EP workload, which works exclusively with broadcasts, not a single packet was lost, and for the MG workload, the maximum registered loss rate was 2.21 %. In all simulated scenarios, losses were lower for the single-hop architecture, considering equivalent scenarios for both. The single-hop WiNoC also had better broadcast delay results for every scenario and workload.
Energy consumption was observed to be a weak point of single-hop WiNoCs, demanding further studies seeking to find improvements. For the most broadcast-heavy workload (MG), energy consumption reached 63.12 J for the 256-node network (Fig. 14). Consumption was higher in the single-hop architecture in every scenario and for every workload. The higher energy consumption is justified by the use of higher power values in the transmitters, receivers, and signals, which is necessary to enable communications in a single hop. Besides, the same power values are used in every unicast communication, not taking distance between nodes into account.
Given the desire to execute increasingly complex programs in feasible time and since applications tend to explore parallelism more and more, it is speculated that tens or hundreds of cores may be incorporated to the same processor in the future (many-core architecture). As a way of assuring better performance for many-core processors, WiNoCs were conceived. This solution consists of connecting routers that communicate by means of radio antennas to processor cores. This way, packets are sent by routers to their neighbors until they get to their destination.
To improve the performance of traditional WiNoC architectures (multi-hop), especially concerning delays in packet delivery, single-hop WiNoCs were proposed, that is, WiNoCs in which communications are conducted in a single hop. Contrarily, in this new approach, transmitters, receivers, and signals must use more power, increasing total power consumption. In this article, two WiNoC architectures (one single-hop architecture and one multi-hop architecture) were presented and evaluated.
To accomplish that, a simulation environment created using the NS-2 simulator for the intra-chip context was used. Simulations were conducted using NPB kernels (CG, EP, FT, IS, and MG). The 1:1 and N:1 collective communication patterns existent in the workloads were simulated using unicast communication, and the 1:N and N:N patterns were simulated using broadcast communication.
Due to the many-core processor architecture context, WiNoCs gained traction in the support of parallel application threads. In this way, it is important to highlight the correlation between performance and energy consumption with collective communication patterns, as discussed in the “Results and discussion” section, as a crucial factor for project decisions. Although both networks are scalable due to the nature of the architectural project, viability is linked to the scalability reachable by parallel applications. This is a very particular situation of the studied context, since a single parallel application uses the entire WiNoC, contrary to what happens in distributed systems in which different applications communicate through the wireless network. In the current high-performance computing scenario, energy efficiency is a vital factor for a many-core processor project to be viable. Although single-hop and multi-hop WiNoCs have shown different behaviors for the same workloads, it is not possible to discard the use of one or the other. Thus, design compromises can lead to a hybrid and reconfigurable WiNoC project or even a lower performance project in order to attain energy consumption gains.
As future work, ways of reducing single-hop WiNoC energy consumption must be investigated, testing other topologies such as 3-D mesh, which allows for a reduction in power by diminishing the distance between nodes. Other options to be explored are hybrid WiNoCs, which mix wired and wireless transmissions and higher-powered multi-hop WiNoCs that are able to reduce the number of hops in transmissions. Characteristics such as antenna size and frequency must also be taken into account as a way of reducing the power necessary to accomplish communication in a single hop, consequently reducing total energy consumption.
Alves MAZ, Freitas HC, Navaux POA (2011) High latency and contention on shared l2-cache for many-core architectures. Parallel Process Lett 21(01): 85–106.
Borkar S, Chien AA (2011) The future of microprocessors. Commun ACM 54(5): 67–77.
Fan D, Zhang H, Wang D, Ye X, Song F, Li G, Sun N (2012) Godson-t: an efficient many-core processor exploring thread-level parallelism. Micro IEEE 32(2): 38–47.
Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks. Morgan Kaufmann Publishers, San Francisco.
Freitas HC, Schnorr LM, Alves MAZ, Navaux POA (2010) Impact of parallel workloads on NoC architecture design In: Proc. 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 551–555.. PDP, Pisa. doi:10.1109/PDP.2010.53.
Kumar R, Zyuban V, Tullsen DM (2005) Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling In: Proc. 32nd International Symposium on Computer Architecture (ISCA), 408–419.
Freitas HC (2009) Programmable multi-cluster NoC architecture to support collective communication patterns. Phd thesis, Universidade Federal do Rio Grande do Sul, Instituto de Informática, Programa de Pós-Graduação em Computação, Porto Alegre, Rio Grande do Sul, Brasil. http://hdl.handle.net/10183/16656.
Ho R, Mai KW, Member S, Horowitz MA (2001) The future of wires. J Proc IEEE 89(4): 490–504.
Bjerregaard T, Mahadevan S (2006) A survey of research and practices of network-on-chip. Comput Surv 38(1): 1–51.
Sanchez D, Michelogiannakis G, Kozyrakis C (2010) An analysis of on-chip interconnection networks for large-scale chip multiprocessors. ACM Trans Archit Code Optim 7(1): 4–1428.
Oliveira PAC (2012) Performance evaluation and parallel workload characterization for wireless networks-on-chip. http://www.biblioteca.pucminas.br/teses/Informatica_OliveiraPAC_1.pdf.
Li X (2012) Survey of wireless network-on-chip systems. Technical report, Auburn University, Technical report. Auburn University, Alabama, USA. http://www.eng.auburn.edu/~agrawvd/THESIS/LI/report.pdf.
Zhao D (2008) Ultraperformance wireless interconnect nanonetworks for heterogeneous gigascale multi-processor SoCs In: Proc. 2th Workshop on Chip Multiprocessor, Memory Systems and Interconnects, 1–3.. CMP-MSI, Beijing.
Ganguly A, Chang K, Deb S, Pande PP, Belzer B, Teuscher C (2011) Scalable hybrid wireless network-on-chip architectures for multi-core systems. J Trans Comput 60(10): 1485–1502.
Amorim AM, Freitas HC (2013) Avaliação de desepenho de redes-em-chip sem fio single-hop com nas parallel benchmarks In: WSCAD-SSC 2013, Porto de Galinhas, Ipojuca, Pernambuco.
Benini L, Micheli GD (2002) Networks on chips: a new SoC paradigm. J Comput 35(1): 70–78.
Ganguly A, Chang K, Pande PP, Belzer B, Nojeh A (2009) Performance evaluation of wireless networks on chip architectures In: Proc. 10th International Symposium on Quality of Electronic Design, 350–355.. ISQED, San Jose. doi:10.1109/ISQED.2009.4810319.
Oliveira PAC, Duarte-Figueiredo FLP, Martins CAPS, Freitas HC, Ribeiro CP, Castro M, Marangozova-Martin V, Méhaut J-F (2011) Performance evaluation of winocs for parallel workloads based on collective communications In: Proc. IADIS Applied Computing, 307–314.. IADIS Applied Computing, Rio de Janeiro.
Amorim AMP, Oliveira PAC, Freitas HC (2012) Integrando traços de execução de aplicações paralelas ao network simulator para simulação de winoc In: Anais.., 1–4.. WSCAD-WIC, Petrópolis. Workshop de Iniciacao Cientifica, XIII Simposio em Sistemas Computacionais.
Carloni LP, Pande P, Xie Y (2009) Networks-on-chip in emerging interconnect paradigms: advantages and challenges In: Proc. 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip, 93–102.. IEEE Computer Society, Washington, DC, USA. doi:10.1109/NOCS.2009.5071456.
Pande PP, Ganguly A, Chang K, Teuscher C (2009) Hybrid wireless network on chip: a new paradigm in multi-core design In: Proc. 2th International Workshop on Network on Chip Architectures, 71–76.. NoCArc, New York. doi:10.1145/1645213.1645230.
Deb S, Chang K, Ganguly A, Pande P (2010) Comparative performance evaluation of wireless and optical architectures In: Proceedings of the IEEE International SOC Conference. SOCC ’10, 487–492.. IEEE Computer Society, Las Vegas. doi:10.1109/SOCC.2010.5784675.
Wang C, Hu WH, Bagherzadeh N (2011) A wireless network-on-chip design for multicore platforms In: Proc. 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing, 409–416.. PDP, Ayia Napa. doi:10.1109/PDP.2011.37.
Zhao D, Wang Y, Li J, Kikkawa T (2011) Design of multi-channel wireless NoC to improve on-chip communication capacity In: Proc. 5th ACM/IEEE International Symposium on Networks-on-Chip, 177–184.. NOCS, Pittsburgh. doi:10.1145/1999946.1999975.
NPBNAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/publications/npb.html. Accessed in May.
NPB (2013) Problem sizes and parameters in NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/publications/npb_problem_sizes.html. Accessed in May.
Sun YR, Kumar S, Jantsch A (2002) Simulation and evaluation for a network on chip architecture using NS-2 In: Proceedings, 20th IEEE NorChip Conference, Copenhagen.
Kourdy R, Yazdanpanah S, Rad MRN (2010) Using the NS-2 network simulator for evaluating multi protocol label switching in network-on-chip In: Second International Conference on Computer Research and Development, 795–799, Washington, DC, USA.
NS (2013) Network simulator (NS-2). Available at http://isi.edu/nsnam/ns/. Accessed in May.
NS (2013) Network simulator (NS-2) contributed code. Available at https://ant.isi.edu/nsnam/index.php/Contributed_Code#Wireless_and_Mobility. Accessed in May.
NS-UWB (2013) NS-2 ultra wide-band (UWB) MAC and PHY simulator. Available at http://uwb.epfl.ch/ns-2/index.html. Accessed in May.
Merz R, Boudec J-YL, Widmer J (2007) An architecture for wireless simulation in NS-2 applied to impulse-radio ultra-wide band networks In: Proceedings Spring Simulation Multiconference, 256–263.. SpringSim, Norfolk. doi:10.1109/ICCRD.2010.145.
Merz R (2008) Interference management in impulse-radio ultra-wide band networks.. Phd thesis, Laboratoire pour les communications informatiques et leurs applications 2, École Polytechnique Fédérale de Lausanne (EPFL), Suisse. http://infoscience.epfl.ch/record/121463.
FAST-BROADCASTResources - Fast Broadcast modules for NS-2. Available at http://www.math.unipd.it/~cpalazzi/fastbroadcast.html. Accessed in May (2013).
IEEE, 802.15 working group for WPAN. Available at http://www.ieee802.org/15. Accessed in August 2012.
This work was partially supported by FIP PUC Minas, FAPEMIG, and CNPq. Our special thanks to Matheus Queiroz who helped us translate and review the article.
The authors declare that they have no competing interests.
All authors read and approved the final manuscript.
About this article
Cite this article
Amorim, A.M., Oliveira, P.A. & Freitas, H.C. Performance evaluation of single- and multi-hop wireless networks-on-chip with NAS Parallel Benchmarks. J Braz Comput Soc 21, 6 (2015). https://doi.org/10.1186/s13173-015-0027-y
- Performance evaluation
- Wireless networks-on-chip
- Single- and multi-hop architectures
- Parallel workloads