1. Introduction
There is an increasing demand in data-acquisition systems for neurophysiology to record simultaneously from many channels over long time periods [
1]. These experiments accumulate large amounts of data, which would be processed by spike sorting systems for analyzing the activities of neurons. A typical spike sorting system [
2,
3] usually involves complicated feature extraction and spike classification operations for separating spikes from background noise and clustering the detected spikes. A large amount of spike trains would impose heavy computational load for a software spike sorting system, resulting in a long processing time.
One approach to reduce the computation time is to implement a spike sorting system by hardware. Hardware systems offering dedicated circuits substantially outperform their software counterparts in terms of computational performance. Hardware solutions are beneficial for neurophysiological signal recordings and analysis where real-time computing is crucial. There are two hardware approaches: Field Programmable Gate Array (FPGA) [
4] and Application-Specific Integrated Circuit (ASIC) [
5]. Some FPGA circuits [
6,
7] possess high area complexities and/or power consumption, which may be suited only for offline processing. On the contrary, ASIC architectures may have lower area costs and power dissipation. Many ASIC-based spike sorting implementations are then proposed for in vivo applications, where area- and power-efficient design is desired.
Many ASIC architectures based on Principal Component Analysis (PCA) [
8,
9,
10] have been proposed for hardware spike sorting. Although they are effective for feature extraction, the inherent complexities for the computations of the covariance matrix and eigenvalue decomposition in the PCA algorithm may impose high hardware and power costs. The PCA variants, such as the Generalized Hebbian Algorithm (GHA) [
11], are able to reduce the hardware costs by lifting the requirements for covariance matrix computation. In the GHA, the principal components are updated incrementally based on a set of training data. Nevertheless, its iterative training procedure may still be a bottleneck for online applications.
Alternatives to PCA and GHA include techniques such as discrete derivatives [
12,
13], integral transform [
14] and zero-crossing [
15] for feature extraction. The techniques feature low computational costs without additional training procedures. Nevertheless, some algorithms may be prone to noises due to the simple feature extraction procedure without taking noise into consideration. In addition, some of the algorithms have not been implemented by hardware. The effectiveness of the algorithms for ASIC implementation as compared with other techniques may still need to be evaluated.
The objective of this paper is to present a novel ASIC implementation for spike sorting featuring high classification accuracy, low area costs and low power consumption. The architecture is based on a novel feature extraction algorithm with low computational complexities. In the feature extraction algorithm, the location of the global minimum (or maximum) of a spike is first identified. Based on the location, the spike is then separated into two portions. The area of each portion is then used as a feature. Similar to the algorithms in [
12,
13,
14,
15], the proposed algorithm is simple to implement. In addition, it may be less susceptible to noise interference. Observe that the variations in the location of the global minimum (or maximum) of a spike due to noises may be small provided that the other local minimal (or maximal) values are significantly larger (or smaller) than the global one. This may be the case for some spike waveforms. The algorithm may then provide high immunity to noise corruption. Furthermore, the area of each position may be divided by the distance between the global maximum and minimum to enhance the classification accuracy.
A novel VLSI architecture is also presented for the novel feature extraction algorithm. In the architecture, the search for the global minimum and the computation of the area of portions separated by the minimum value are carried out concurrently. This is beneficial for pipelining operation for enhancing the throughput of the feature extraction. To further expedite the process, the local peak search and area computation over smaller segments of the spike can be carried out first. The local peak values and areas are subsequently merged with the global ones by a simple hardware circuit. In addition, due to its simplicity, the architecture can be operated in conjunction with other spike detection circuits. In this work, the circuit based on the Non-linear Energy Operator (NEO) [
16] is employed for the multi-channel spike detection. To minimize the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. A Clock Gating (CG) technique [
17] is also employed to supply the system clock only to the active components of the circuit. A number of ASIC implementations are presented to demonstrate the effectiveness of the proposed architecture. Experimental results reveal that the proposed architecture is an effective alternative for in vivo multi-channel spike sorting with high classification accuracy, low power dissipation and low hardware area costs.
The remaining parts of this paper are organized as follows. The proposed feature extraction algorithm and the corresponding VLSI architecture are presented in
Section 2. The architecture of the multi-channel spike sorting system supporting both the NEO and the proposed algorithm is presented in
Section 3.
Section 4 then evaluates the performance of the spike sorting system. Finally,
Section 5 includes some concluding remarks.
3. Proposed Architecture for Multi-Channel Feature Extraction
The proposed architecture can be easily operated in conjunction with the multi-channel spike detection circuit for multi-channel feature extraction.
Figure 13 shows the architecture of a multi-channel spike sorting system based on the proposed feature extraction circuit. As depicted in the figure, there are three components: spike detection circuit, spike buffer and the proposed feature extraction circuit.
The spike detection circuit is able to carry out the detection for multiple channels by the NEO algorithm. Let M be the number of channels for spike sorting. Assume all of the channels are sampled with the same sampling rate . Let be the sampling period. All of the channels are assumed to be sampled and multiplexed by a mixed mode circuit using the round robin approach. The mixed-mode circuit then delivers the samples one at a time to the spike detection circuit. Therefore, the circuit receives M samples during a time interval of length . Different samples received during the interval are from different channels.
The spike detection circuit can be separated into two portions: the NEO buffer and the NEO detection unit. The NEO buffer is a ()-stage Serial-In-Serial-Out (SISO) shift register organized in a snake-like fashion. Each stage contains a sample of a spike train from a channel. It is therefore able to hold N consecutive samples of the spike trains from the M channels.
The bottom row of the buffer provides
N consecutive samples of the spike train from a channel (say, channel
h). It can be seen from
Figure 14 that the bottom row of the NEO buffer is used for both NEO detection and peak alignment. The NEO detection unit takes three consecutive samples of the bottom row to carry out the NEO computation. The computation result is then compared to a given threshold
γ. If the result is larger than the threshold, a hit event is issued. In addition, the entire last row is regarded as a detected spike and is copied to the spike buffer for spike alignment.
The spike detection circuit is able to perform spike detection one channel at a time. After the spike detection of the channel h is completed in the current clock cycle, all of the spike samples already in the NEO buffer are shifted to the next stage, and a new sample from the next channel (selected in a round robin fashion) enters the first stage of the NEO buffer. This allows the spike detection for the next channel to be carried out in the next clock cycle.
The goal of the spike buffer is to hold the detected spikes produced by the spike detection circuit, carry out the alignment and deliver the detected spikes to the proposed feature extraction circuit. As depicted in
Figure 15, there are two stages in the spike buffer: the input stage and the output stage. The input stage is an
N-input
N-output buffer holding the detected spikes provided by the spike detection circuit. The output stage is an
N-input
K-output buffer fetching data
N samples at a time from the input buffer. The buffer then delivers data
K samples at a time to the proposed feature extraction circuit.
The input stage contains M entries, where each entry i stores the detected spike for channel i. Given N, K, the maximum number of channels M can be evaluated. For any channel h in the circuit, a detected spike in that channel could be discarded when the spike is over-written in the spike buffer by the next detected spike from the same channel h before it can be further processed.
Recall from
Section 2.2.3 that the proposed architecture supporting parallel computation is able to accept
K samples at a time. Therefore, the proposed architecture is able to process a detected spike with
N samples in
clock cycles. To find the maximum number of channels
M, the worst case scenario is considered. In the scenario, the detected spikes from all of the
M channels are stored in the spike buffer and are not processed by the feature extraction circuit yet. In addition, the feature extraction circuit is currently busy. Assume that the newest detected spike is from channel
h. In this case, the spike buffer is able to accept another detected spike from channel
h without discarding the old one only after the feature extraction circuit has processed
M spikes. Because the latency for processing each detected spike is
clock cycles, it follows that the input buffer is able to accept another detected spike from channel
h after
clock cycles. Let
Q be the minimum number of samples between the peak of successive spikes detected by the NEO circuit from the same channel. Assume that
Q is the same for all of the channels. It then follows that a detected spike from channel
h is not overwritten and discarded when:
where
is the clock period and
is the clock rate. It is interesting to know that the NEO circuit imposes an additional limit on the number of channels
M. It is desired that the NEO circuit be able to receive one sample from each channel in a single sampling period
. Based on the round robin scheme for fetching samples for different channels, it is therefore necessary that:
Combining Equations (
33) and (
34), we see that the maximum number of channels, denoted by
, should then satisfy:
4. Experimental Results
This section presents the performance of the proposed architecture. The area complexities are first considered. Five types of area costs are evaluated: the number of comparators, adders/subtractors, multipliers/dividers, registers and multiplexers. All of the costs are expressed in terms of the asymptotic function (i.e., the big
O function).
Table 1 shows the area complexities of the proposed feature extraction circuit. It can be observed from
Table 1 that all of the area costs of the GMS units, ACC units, GMS merger units, ACC merger units and FS units are independent of the spike length
N and the number of channels
M. However, for the parallel computation, the number of comparators, adders/subtractors, registers and multiplexers of all of the leaf nodes will grow with the number of segments
K. This is because the total number of leaf nodes is dependent on
K, and each leaf node contains a GMS unit and an ACC unit.
Similarly, because the total number of intermediate nodes is dependent on K and each intermediate node contains a GMS merger unit and an ACC merger unit, the area complexities of comparators, adders/subtractors and multiplexers of all of the intermediate nodes are . On the contrary, there is only one root node for parallel computation, and the root node consists of only the FS unit. Among the units in the proposed architecture, the FS unit is the only unit containing multipliers/dividers. Therefore, the number of multipliers/dividers is fixed, and is independent of N, M and K. Summarizing the discussions stated above, we conclude that the complexities of the total number of comparators, adders/subtractors, registers and multiplexers of the proposed feature extraction circuit with K segments are . Moreover, the total number of multipliers/dividers is only .
Based on
Table 1, the overall area complexities of the proposed spike sorting circuit are summarized in
Table 2. To facilitate the evaluation, the area complexities of the NEO circuit and spike buffer are also included in the table. For the NEO circuit, it is necessary to store spike trains from all of the
M channels for detection. In the spike buffer, each channel needs to have its own memory unit to hold the detected spikes for the subsequent feature extraction. Therefore, it can be observed from
Table 2 that the number of registers in both the NEO circuit and spike buffer are dependent on the spike length
N and the number of channels
M. The total number of registers of the proposed spike sorting circuit is then
. In addition, because the NEO circuit has only a fixed number of adders and multipliers independent of
N and
M, the overall area complexities for the adders and multipliers in the spike sorting circuit are only
and
, respectively.
We next evaluate the actual hardware resource consumption of the proposed circuit. For the remaining evaluations of this section, the dimension of spikes is
. The circuit implementation is carried out by ASIC with the Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm technology and clock gating. The gate level design platform is the Synopsys Design Compiler.
Table 3 shows the area (
m
) of the proposed circuit for different numbers of channels
M and numbers of segments
K. From
Table 3, we see that the area of the proposed circuit grows with
M and
K, which is consistent with the results shown in
Table 2.
Table 4 shows the normalized area (
m
per channel) of the proposed architecture. The normalized area is defined as the area of the circuit divided by the number of channels
M. The normalized area can be viewed as the average area cost per channel. Note that all of the channels share the same computation cores in the NEO circuit (i.e., the NEO detection unit) and the proposed feature extraction circuit. Therefore, we can see from
Table 4 that the average area per channel decreases as
M increases. Consequently, because of the hardware resource sharing scheme, the proposed architecture shows a higher efficiency in area costs for a larger number of channels. Although the normalized area grows with
K for a fixed
M, the circuits with larger
K have the advantages of lower latency for feature extraction. Therefore, as shown in Equation (
35), the channel capacity
grows with
K.
In addition to the area costs, the power consumption of the proposed architecture is also evaluated.
Table 5 shows the normalized power dissipation (
W per channel) for different numbers of channels
M with and without clock gating. The normalized power dissipation is defined as the total power consumption of the circuit divided by the number of channels
M. In the experiment, the number of segment
, and the clock rate
is 1 MHz. The power consumption is measured numerically by Synopsis Prime Time. When the number of channels
M increases, because the normalized area decreases, we can see from
Table 5 that the normalized power consumption also lowers for the circuit without clock gating. In addition, the clock gating is able to further reduce the power consumption by not supplying the clock signal to the inactive components. However, when
M increases, the components of the proposed feature extraction circuit may become more busy because all of the
M channels share the circuit. As a result, when
M increases from 32–64, it can be observed from
Table 5 that the power consumption with clock gating is not lowered. Nevertheless, the the circuit with clock gating is still able to achieve a 33% power reduction as compared with its counterpart without clock gating.
We also compare the proposed architecture with other ASIC implementations [
8,
9,
10,
11] for spike sorting in
Table 6. It may be difficult to directly compare these architectures because they may be implemented by different technologies and may be based on different spike detection and/or feature extraction algorithms. Moreover, their operation clock rates may be different. Nevertheless, it can be observed from
Table 6 that, as compared with the existing architectures, the proposed architecture provides effective area-power performance while supporting both spike detection and feature extraction functions.
Note that, among these existing architectures, the proposed architecture and the one in [
11] are based on the same clock rate, technology and spike detection scheme. We can see from
Table 6 that the normalized area and normalized power of the proposed architecture are only 25.40% (i.e., 21,127 vs. 83,159) and 25.44% (i.e., 20.03 vs. 78.719) of those of the architecture in [
11], respectively. This is because the proposed feature extraction algorithm has significantly lower computational complexities than those of the GHA algorithm adopted by [
11].
Although the proposed feature extraction algorithm has simple computation, it is effective for spike classification.
Table 7,
Table 8,
Table 9 and
Table 10 show the Classification Success Rate (CSR) of the Fuzzy C-Means (FCM) [
19] algorithm by clustering the feature vectors produced by various feature extraction algorithms. The CSR is defined as the number of spikes that are correctly classified divided by the total number of spikes. The PCA, GHA and zero-crossing [
15] algorithms considered in the tables are implemented by MATLAB with double precision floating numbers. The proposed algorithm is implemented by hardware with 10-bit finite precision.
To see the robustness of the proposed algorithm, different types of noise interference are included in the experiments: background noises, interfering neurons and overlapping spikes. They are considered separately to facilitate our observation. The spike trains for the experiments in
Table 7,
Table 8 and
Table 9 are obtained from the simulator developed in [
18]. It also gives access to the ground truth about the spiking activity in the recording for quantitative assessment.
Table 7 shows the CSRs of various feature extraction algorithms for background noises with different SNR levels. In the experiments, the number of interfering neurons is set to be zero. In addition, 20% of the spikes are overlapping. It can be observed from
Table 7 that the CSRs of the proposed algorithm are only slightly degraded as the SNR values decrease. In addition, the proposed algorithm has CSR performance comparable to that of PCA and GHA for all of the SNR levels. Moreover, it outperforms the zero-crossing algorithm. When the SNR = 1 dB, the proposed algorithm is still able to achieve 96.47% CSR. As shown in
Figure 1, the proposed algorithm is robust to background noise because the variations of peak locations due to background noise corruption are small. This leads to successful identification of intervals
and
for the computation of features
and
.
The CSRs of various algorithms for different numbers of interfering neurons are revealed in
Table 8. The interfering neurons are the nearby neurons whose spike times are correlated with the original spike times of the target neurons. We set SNR = 4 dB for the background noises. The percentage of the overlapping spikes for the experiments is 20%. From
Table 8, we can see that only a small degradation is observed for the proposed algorithm as the number of interfering neurons grows. Its performance is also comparable to that of the PCA and GHA algorithms.
The influences of the overlapping spikes on CSRs are considered in
Table 9. In the experiments, the SNR level of background noises is 8 dB. In addition, there are no interfering neurons, so that the the impact of overlapping spikes can be easily observed. It appears from
Table 9 that the proposed algorithm is able to maintain CSRs above 95% even when the percentage of the overlapping spikes is 30%. Because the proposed algorithm may still be able to find the
or
correctly for the overlapping spikes as revealed in
Figure 3 and
Figure 4, successful feature extraction and classification may still be possible. As a result, the proposed algorithm may be able to maintain high CSRs in the presence of overlapping spikes.
In addition to the simulator developed in [
18], the spike trains in the wave_clus database are also considered in the experiments.
Table 10 shows the resulting comparisons. Similar to the results shown in
Table 7,
Table 8 and
Table 9, we can also see from
Table 10 that the CSR performances of the PCA, GHA and the proposed algorithm are comparable. In addition, the proposed algorithm outperforms the zero-crossing algorithm. Although PCA and GHA offer higher classification accuracy, the algorithms have higher computational complexities. We can see from
Table 6 that their hardware implementations may have large area costs and power consumption. The computational complexities of zero-crossing are low. However, the algorithm has inferior CSR values. In particular, for the spike train data “C_Difficult2_noise005” in
Table 10, the CSR of the zero-crossing is only 69.92%. By contrast, the proposed algorithm still maintain high accuracy (i.e., 82.13%) for the data. Therefore, the proposed algorithms have the advantages of low computational complexities for efficient hardware implementation and are capable of providing feature vectors for spike classification with high accuracy.