MATRIX TRANSFER ACCELERATOR SYSTEM AND METHOD

Title:

MATRIX TRANSFER ACCELERATOR SYSTEM AND METHOD

Document Type and Number:

WIPO Patent Application WO/2018/160773

Kind Code:

Abstract:

A matrix transfer accelerator (MTA) (0111) system/method coordinates data transfers between an external data memory (EDM) (0130) and a local data memory (LDM) (0114) using matrix tiling and/or grouping. The system utilizes foreground/background buffering that overlaps compute and data transfer operations and permits data transfers with or without zero pad peripheral matrix filling. The system may incorporate a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from the EDM (0130) to the LDM (0114) based on a set of DMA controller registers including data width register (DWR), transfer count register (TCR), fill count register (FCR), EDM source address register (ESR), and LDM target address register (LTR). The ZDC transfers data from the EDM (0130) ESR to the LDM (0114) LTR, such that EDM data is automatically zero-filled around a periphery of a matrix written to the LDM matrix based on the FCR value.

Inventors:

REDFERN ARTHUR (US)
BHARDWAJ ASHEESH (US)

Application Number:

PCT/US2018/020334

Publication Date:

September 07, 2018

Filing Date:

February 28, 2018

Export Citation:

Click for automatic bibliography generation Help

Assignee:

TEXAS INSTRUMENTS INC (US)
TEXAS INSTRUMENTS JAPAN LTD (JP)

International Classes:

G06F17/16; G06F9/38

Foreign References:

US5099447A	1992-03-24
US20040136316A1	2004-07-15
US5870568A	1999-02-09
US5745793A	1998-04-28

Attorney, Agent or Firm:

DAVIS, Jr., Michael A. et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

What is claimed is:

1. A matrix transfer accelerator (MTA) system comprising:

(a) an external data memory (EDM);

(b) a local data memory (LDM); and

wherein:

said EDM includes one or more input feature map (IFM) storage elements;

said IFM include one or more large feature map (LFM) storage elements; and

said DTP is configured to transfer data between said EDM and said LDM by sequentially executing the following operations:

(1) initializing a column tile processing counter (C=0);

(2) transferring a column tile of LFM[*,C] from said EDM to said LDM;

(3) processing data in a first column tile of said LFM[*,C] stored in said LDM;

(4) transferring a column tile of said LFM[*,C+1] from said EDM to said LDM;

(5) incrementing said column tile counter (C=C+1);

(6) concurrent with operation step (7), processing data in first half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]);

(7) concurrent with operation step (6), transferring a column tile of said LFM[*,C+1] from said EDM to said LDM;

(8) processing data in second half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]); and

(9) determining if all column tile processing is complete, and if not, proceeding to said step (5).

2. The matrix transfer accelerator (MTA) system of Claim 1 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel; said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

3. The matrix transfer accelerator (MTA) system of Claim 1 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

4. The matrix transfer accelerator (MTA) system of Claim 1 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR); wherein:

said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

5. The matrix transfer accelerator (MTA) system of Claim 1 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

6. A matrix transfer accelerator (MTA) system comprising:

external data memory (EDM);

(a) local data memory (LDM); and

(b) data transfer processor (DTP);

wherein:

said LDM includes one or more output feature map (OFM) storage elements;

said OFM include one or more large feature map (LFM) storage elements;

said DTP is configured to transfer data between said EDM and said LDM by sequentially executing the following operations:

(1) Initializing a column tile processing counter (C=0); (2) Processing left padding (Lpad) and partial data in a first half of a first column tile of said LFM[*,C] stored in said LDM;

(3) Processing data in a second half of a first column tile of said LFM[*,C] stored in said LDM;

(4) Incrementing said column tile counter (C=C+1);

(5) Concurrent with operation step (6), processing data in a first half of a column tile of said LFM[*,C] stored in said LDM;

(6) Concurrent with operation step (5), transferring a column tile of said LFM[*,C-1] from said LDM to said EDM;

(7) Processing data in a second half of a column tile of said LFM[*,C] stored in said LDM;

(8) Determining if all said LFM tile data in the said LDM has been processed (including partial tile data adjacent to right padding (Rpad) data), and if not, proceeding to step (10);

(9) Transferring a last column tile of LFM[*,C] from said LDM to said EDM; and

(10) Determining if all column tile processing is complete, and if not, proceeding to said step (4).

7. The matrix transfer accelerator (MTA) system of Claim 6 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

8. The matrix transfer accelerator (MTA) system of Claim 6 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

9. The matrix transfer accelerator (MTA) system of Claim 6 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

10. The matrix transfer accelerator (MTA) system of Claim 6 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

11. A matrix transfer accelerator (MTA) system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more input feature map (IFM) storage elements;

said IFM include one or more large feature map (LFM) storage elements;

said DTP is configured to transfer data between said EDM and said LDM by sequentially executing the following operations:

(1) Initializing a column tile processing counter (C=0);

(2) Padding a left column tile (Lpad) of said LFM[*,C] stored in said LDM;

(3) Transferring a column tile of said LFM[*,C] from said EDM to said LDM;

(4) Incrementing said column tile counter (C=C+1);

(5) Concurrent with operation step (6), processing data in first half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]);

(6) Concurrent with operation step (5), transferring a column tile of said LFM[*,C+1] from said EDM to said LDM; (7) Processing data in second half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]);

(8) Determining if all said LFM tile data has been transferred to said LDM, and if not, proceeding to step (10);

(9) Padding a right column tile (Rpad) of said LFM[*,C] stored in said LDM; and

(10) Determining if all column tile processing is complete, and if not, proceeding to said step (4).

12. The matrix transfer accelerator (MTA) system of Claim 11 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

13. The matrix transfer accelerator (MTA) system of Claim 11 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

14. The matrix transfer accelerator (MTA) system of Claim 11 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

15. The matrix transfer accelerator (MTA) system of Claim 11 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

16. A matrix transfer accelerator (MTA) system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said LDM includes one or more output feature map (OFM) storage elements;

said OFM include one or more large feature map (LFM) storage elements;

said DTP is configured to transfer data between said EDM and said LDM by sequentially executing the following operations:

(1) Initializing a column tile processing counter (C=0);

(2) Processing data in a first half of a first column tile of said LFM[*,C] stored in said LDM;

(3) Processing data in a second half of said first column tile of said LFM[*,C] stored in said LDM;

(4) Incrementing said column tile counter (C=C+1);

(5) Concurrent with operation step (6), processing data in a first half of a column tile of said LFM[*,C] stored in said LDM;

(6) Concurrent with operation step (5), transferring a column tile of said LFM[*,C-1] from said LDM to said EDM;

(7) Processing data in a second half of a column tile of said LFM[*,C] stored in said LDM; and

(8) Determining if all column tile processing is complete, and if not, proceeding to said step (4).

17. The matrix transfer accelerator (MTA) system of Claim 16 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and (c) third data transfer processor (TDP);

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

18. The matrix transfer accelerator (MTA) system of Claim 16 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

19. The matrix transfer accelerator (MTA) system of Claim 16 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR); (c) fill count register (FCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

20. The matrix transfer accelerator (MTA) system of Claim 16 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

21. A matrix transfer accelerator (MTA) system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more output feature map (OFM) storage elements;

said EDM includes one or more filter coefficient multiplier (FCM) storage elements; said EDM includes one or more input feature map (IFM) storage elements; said LDM further includes a foreground output feature map (OFM-fore) storage element; said LDM further includes a background output feature map (OFM-back) storage element;

said LDM further includes a foreground filter coefficient multiplier (FCM-fore) storage element;

said LDM further includes a background filter coefficient multiplier (FCM-back) storage element;

said LDM further includes a foreground input feature map (IFM-fore) storage element; said DTP is configured to transfer small feature maps (SFM) with no pad insertion between said EDM and said LDM by sequentially:

(1) executing a lD-to-lD data transfer of all said IFM from said EDM to said LDM;

(2) concurrent with steps (2)-(5), executing a lD-to-lD data transfer of said FCM to said FCM-back via a data transfer from said EDM to said LDM;

(3) concurrent with steps (2)-(5), transferring a previously calculated output feature matrix (OFM) (OFM-back) from said LDM to said EDM;

(4) concurrent with steps (2)-(5), calculating an output matrix product (OMP) and storing said OMP in said OFM-fore via the relation OFM-fore = (FCM-fore * IFM-fore);

(5) concurrent with steps (2)-(5), swapping foreground/background ping/pong memory pointers (fore/back) for OFM-fore/OFM-back and FCM-fore/FCM-back; and

(6) removing seams or inserting zero padding in said OMP based on whether output padding is enabled for said OMP.

22. The matrix transfer accelerator (MTA) system of Claim 21 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel; said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

23. The matrix transfer accelerator (MTA) system of Claim 21 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

24. The matrix transfer accelerator (MTA) system of Claim 21 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR); wherein:

said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

25. The matrix transfer accelerator (MTA) system of Claim 21 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

26. A matrix transfer accelerator (MTA) system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more output feature map (OFM) storage elements;

said EDM includes one or more filter coefficient multiplier (FCM) storage elements; said EDM includes one or more input feature map (IFM) storage elements;

said LDM further includes a foreground output feature map (OFM-fore) storage element; said LDM further includes a background output feature map (OFM-back) storage element;

said LDM further includes a foreground filter coefficient multiplier (FCM-fore) storage element;

said LDM further includes a background filter coefficient multiplier (FCM-back) storage element;

said LDM further includes a foreground input feature map (IFM-fore) storage element; said DTP is configured to transfer small feature maps (SFM) with pad insertion between said EDM and said LDM by sequentially:

(1) executing a 2D-to-2D data transfer of all said IFM from said EDM to said LDM leaving space in said LDM for zero filling;

(2) executing a peripheral zero-fill operation on said 2D-to-2D data stored in said LDM;

(3) concurrent with steps (3)-(6), executing a lD-to-lD data transfer of said FCM to said FCM-back via a data transfer from said EDM to said LDM;

(4) concurrent with steps (3)-(6), transferring a previously calculated output feature matrix (OFM) (OFM-back) from said LDM to said EDM;

(5) concurrent with steps (3)-(6), calculating an output matrix product (OMP) and storing said OMP in said OFM-fore via the relation OFM-fore = (FCM-fore * IFM-fore);

(6) concurrent with steps (3)-(6), swapping foreground/background ping/pong memory pointers (fore/back) for OFM-fore/OFM-back and FCM-fore/FCM-back; and

(7) removing seams or inserting zero padding in said OMP based on whether output padding is enabled for said OMP.

27. The matrix transfer accelerator (MTA) system of Claim 26 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel; said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

28. The matrix transfer accelerator (MTA) system of Claim 26 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

29. The matrix transfer accelerator (MTA) system of Claim 26 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR); wherein:

said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

30. The matrix transfer accelerator (MTA) system of Claim 26 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

31. A matrix transfer accelerator (MTA) system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more output feature map (OFM) storage elements;

said EDM includes one or more filter coefficient multiplier (FCM) storage elements; said EDM includes one or more input feature map (IFM) storage elements;

said LDM further includes a foreground output feature map (OFM-fore) storage element; said LDM further includes a background output feature map (OFM-back) storage element;

said LDM further includes a foreground filter coefficient multiplier (FCM-fore) storage element;

said LDM further includes a background filter coefficient multiplier (FCM-back) storage element;

(1) executing a lD-to-lD data transfer of all said IFM from said EDM to said LDM;

(2) executing a 2D-to-2D data transfer of all input feature maps (IFM) from said LDM to said LDM leaving space in said LDM for zero filling;

(3) executing a peripheral zero-fill operation on said 2D-to-2D data stored in said LDM;

(4) concurrent with steps (4)-(7), executing a lD-to-lD data transfer of said FCM to said FCM-back via a data transfer from said EDM to said LDM;

(5) concurrent with steps (4)-(7), transferring a previously calculated output feature matrix (OFM) (OFM-back) from said LDM to said EDM;

(6) concurrent with steps (4)-(7), calculating an output matrix product (OMP) and storing said OMP in said OFM-fore via the relation OFM-fore = (FCM-fore * IFM-fore);

(7) concurrent with steps (4)-(7), swapping foreground/background ping/pong memory pointers (fore/back) for OFM-fore/OFM-back and FCM-fore/FCM-back; and

(8) removing seams or inserting zero padding in said OMP based on whether output padding is enabled for said OMP.

32. The matrix transfer accelerator (MTA) system of Claim 31 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein: said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

33. The matrix transfer accelerator (MTA) system of Claim 31 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

34. The matrix transfer accelerator (MTA) system of Claim 31 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and (e) LDM target address register (LTR);

wherein:

said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

35. The matrix transfer accelerator (MTA) system of Claim 31 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

36. A matrix transfer accelerator (MTA) system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more output feature map (OFM) storage elements;

said EDM includes one or more filter coefficient multiplier (FCM) storage elements; said EDM includes one or more input feature map (IFM) storage elements;

said LDM further includes a foreground output feature map (OFM-fore) storage element; said LDM further includes a background output feature map (OFM-back) storage element;

said LDM further includes a foreground filter coefficient multiplier (FCM-fore) storage element;

said LDM further includes a background filter coefficient multiplier (FCM-back) storage element;

(1) executing a lD-to-lD data transfer of all said IFM from said EDM to said LDM with peripheral zero filling of said LDM data;

(2) concurrent with steps (2)-(5), executing a lD-to-lD data transfer of said FCM to said FCM-back via a data transfer from said EDM to said LDM;

(3) concurrent with steps (2)-(5), transferring a previously calculated output feature matrix (OFM) (OFM-back) from said LDM to said EDM;

(4) concurrent with steps (2)-(5), calculating an output matrix product (OMP) and storing said OMP in said OFM-fore via the relation OFM-fore = (FCM-fore * IFM-fore);

(5) concurrent with steps (2)-(5), swapping foreground/background ping/pong memory pointers (fore/back) for OFM-fore/OFM-back and FCM-fore/FCM-back; and

(6) removing seams or inserting zero padding in said OMP based on whether output padding is enabled for said OMP.

37. The matrix transfer accelerator (MTA) system of Claim 36 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB); said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery pad-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

38. The matrix transfer accelerator (MTA) system of Claim 36 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that includes:

(a) first data transfer processor (FDP);

(b) second data transfer processor (SDP); and

wherein:

said FDP, said SDP, and said TDP operate in parallel;

said FDP transfers data from said EDM to a first read data buffer (FDB);

said SDP transfers data from a second read data buffer (SDB) to a circular write buffer (CWB) with additional matrix periphery zero-fill during said SDB-to-CWB data transfer;

said TDP path transfers data from said CWB to said LDM;

said data transfers to said FDB are alternated with said SDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB; and

said data transfers from said SDB are alternated with said FDB in a ping-pong fashion after every completion of said FDP transfer from said EDM to said FDB.

39. The matrix transfer accelerator (MTA) system of Claim 36 wherein said MTA further includes a pad-fill direct memory access (DMA) controller (PDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein: said PDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by a width value in said DWR;

said PDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

40. The matrix transfer accelerator (MTA) system of Claim 36 wherein said MTA further includes a zero-fill direct memory access (DMA) controller (ZDC) that transfers data from said EDM to said LDM based on the content of a set of DMA controller registers including:

(a) data width register (DWR);

(b) transfer count register (TCR);

(d) EDM source address register (ESR); and

(e) LDM target address register (LTR);

wherein:

said ZDC transfers matrix data from said EDM at said ESR address to said LDM at said LTR address;

said EDM consists of matrix row data having a data width defined by said DWR;

said ZDC is configured to transfer data from said EDM to said LDM and automatically peripherally pad-fill matrix data written to said LDM based on a count value in said FCR.

41. A matrix transfer accelerator (MTA) method operating on a matrix transfer accelerator (MTA) system, said system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more input feature map (IFM) storage elements;

said IFM include one or more large feature map (LFM) storage elements;

said DTP is configured to transfer data between said EDM and said LDM;

said method is executed on said DTP and includes the steps of:

(1) Initializing a column tile processing counter (C=0); (2) Transferring a column tile of LFM[*,C] from said EDM to said LDM;

(3) Processing data in a first column tile of said LFM[*,C] stored in said LDM;

(4) Transferring a column tile of said LFM[*,C+1] from said EDM to said LDM;

(5) Incrementing said column tile counter (C=C+1);

(6) Concurrent with operation step (7), processing data in first half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]);

(7) Concurrent with operation step (6), transferring a column tile of said LFM[*,C+1] from said EDM to said LDM;

(8) Processing data in second half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]); and

(9) Determining if all column tile processing is complete, and if not, proceeding to said step (5).

42. A matrix transfer accelerator (MTA) method operating on a matrix transfer accelerator (MTA) system, said system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said LDM includes one or more output feature map (OFM) storage elements;

said OFM include one or more large feature map (LFM) storage elements;

said DTP is configured to transfer data between said EDM and said LDM;

said method is executed on said DTP and includes the steps of:

(1) Initializing a column tile processing counter (C=0);

(2) Processing left padding (Lpad) and partial data in a first half of a first column tile of said LFM[*,C] stored in said LDM;

(3) Processing data in a second half of a first column tile of said LFM[*,C] stored in said LDM;

(4) Incrementing said column tile counter (C=C+1);

(5) Concurrent with operation step (6), processing data in a first half of a column tile of said LFM[*,C] stored in said LDM;

(6) Concurrent with operation step (5), transferring a column tile of said LFM[*,C-1] from said LDM to said EDM;

(7) Processing data in a second half of a column tile of said LFM[*,C] stored in said LDM;

(8) Determining if all said LFM tile data in the said LDM has been processed (including partial tile data adjacent to right padding (Rpad) data), and if not, proceeding to step (10);

(9) Transferring a last column tile of LFM[*,C] from said LDM to said EDM;

(10) Determining if all column tile processing is complete, and if not, proceeding to said step (4).

43. A matrix transfer accelerator (MTA) method operating on a matrix transfer accelerator (MTA) system, said system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more input feature map (IFM) storage elements;

said IFM include one or more large feature map (LFM) storage elements;

said DTP is configured to transfer data between said EDM and said LDM;

said method is executed on said DTP and includes the steps of:

(1) Initializing a column tile processing counter (C=0);

(2) Padding a left column tile (Lpad) of said LFM[*,C] stored in said LDM;

(3) Transferring a column tile of said LFM[*,C] from said EDM to said LDM;

(4) Incrementing said column tile counter (C=C+1);

(5) Concurrent with operation step (6), processing data in first half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]);

(6) Concurrent with operation step (5), transferring a column tile of said LFM[*,C+1] from said EDM to said LDM;

(7) Processing data in second half of adjacent column tiles of said LFM stored in said LDM (LDM[*,C-1] and LDM[*,C]);

(8) Determining if all said LFM tile data has been transferred to said LDM, and if not, proceeding to step (10); (9) Padding a right column tile (Rpad) of said LFM[*,C] stored in said LDM;

(10) Determining if all column tile processing is complete, and if not, proceeding to said step (4).

44. A matrix transfer accelerator (MTA) method operating on a matrix transfer accelerator (MTA) system, said system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said LDM includes one or more output feature map (OFM) storage elements;

said OFM include one or more large feature map (LFM) storage elements;

said DTP is configured to transfer data between said EDM and said LDM;

said method is executed on said DTP and includes the steps of:

(1) Initializing a column tile processing counter (C=0);

(2) Processing data in a first half of a first column tile of said LFM[*,C] stored in said LDM;

(3) Processing data in a second half of said first column tile of said LFM[*,C] stored in said LDM;

(4) Incrementing said column tile counter (C=C+1);

(5) Concurrent with operation step (6), processing data in a first half of a column tile of said LFM[*,C] stored in said LDM;

(6) Concurrent with operation step (5), transferring a column tile of said LFM[*,C-1] from said LDM to said EDM;

(7) Processing data in a second half of a column tile of said LFM[*,C] stored in said LDM;

(8) Determining if all column tile processing is complete, and if not, proceeding to said step (4).

45. A matrix transfer accelerator (MTA) method operating on a matrix transfer accelerator (MTA) system, said system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and (c) data transfer processor (DTP);

wherein:

said EDM includes one or more output feature map (OFM) storage elements;

said EDM includes one or more filter coefficient multiplier (FCM) storage elements; said EDM includes one or more input feature map (IFM) storage elements;

said LDM further includes a foreground output feature map (OFM-fore) storage element; said LDM further includes a background output feature map (OFM-back) storage element;

said LDM further includes a foreground filter coefficient multiplier (FCM-fore) storage element;

said LDM further includes a background filter coefficient multiplier (FCM-back) storage element;

said LDM further includes a foreground input feature map (IFM-fore) storage element; said DTP is configured to transfer small feature maps (SFM) between said EDM and said

LDM;

said method is executed on said DTP and includes the steps of:

(1) executing a lD-to-lD data transfer of all said IFM from said EDM to said LDM;

(2) concurrent with steps (2)-(5), executing a lD-to-lD data transfer of said FCM to said FCM-back via a data transfer from said EDM to said LDM;

(3) concurrent with steps (2)-(5), transferring a previously calculated output feature matrix (OFM) (OFM-back) from said LDM to said EDM;

(4) concurrent with steps (2)-(5), calculating an output matrix product (OMP) and storing said OMP in said OFM-fore via the relation OFM-fore = (FCM-fore * IFM-fore);

(5) concurrent with steps (2)-(5), swapping foreground/background ping/pong memory pointers (fore/back) for OFM-fore/OFM-back and FCM-fore/FCM-back; and

(6) removing seams or inserting zero padding in said OMP based on whether output padding is enabled for said OMP.

46. A matrix transfer accelerator (MTA) method operating on a matrix transfer accelerator (MTA) system, said system comprising:

(a) external data memory (EDM);

(b) local data memory (LDM); and

wherein:

said EDM includes one or more output feature map (OFM) storage elements;

said EDM includes one or more filter coefficient multiplier (FCM) storage elements; said EDM includes one or more input feature map (IFM) storage elements;

said LDM further includes a foreground output feature map (OFM-fore) storage element; said LDM further includes a background output feature map (OFM-back) storage element;

said LDM further includes a foreground filter coefficient multiplier (FCM-fore) storage element;

said LDM further includes a background filter coefficient multiplier (FCM-back) storage element;

said LDM further includes a foreground input feature map (IFM-fore) storage element; said DTP is configured to transfer small feature maps (SFM) between said EDM and said

LDM;

said method is executed on said DTP and includes the steps of:

(1) executing a 2D-to-2D data transfer of all said IFM from said EDM to said LDM leaving space in said LDM for zero filling;

(2) executing a peripheral zero-fill operation on said 2D-to-2D data stored in said LDM;

(3) concurrent with steps (3)-(6), executing a lD-to-lD data transfer of said FCM to said FCM-back via a data transfer from said EDM to said LDM;

(4) concurrent with steps (3)-(6), transferring a previously calculated output feature matrix (OFM) (OFM-back) from said LDM to said EDM;

(5) concurrent with steps (3)-(6), calculating an output matrix product (OMP) and storing said OMP in said OFM-fore via the relation OFM-fore = (FCM-fore * IFM-fore); (6) concurrent with steps (3)-(6), swapping foreground/background ping/pong memory pointers (fore/back) for OFM-fore/OFM-back and FCM-fore/FCM-back; and

(7) removing seams or inserting zero padding in said OMP based on whether output padding is enabled for said OMP.

47. A matrix transfer accelerator (MTA) method operating on a matrix transfer accelerator (MTA) system, said system comprising: