The Many-Soft-Core Blog

dijous, 2 de juliol del 2015

Bibliography on Multi-soft-core and Many-soft-core systems

By multi-soft-core and many-soft-core systems I mean multiprocessor systems implemented in FPGAs using soft-core processors.

Here you can find a list of multi-soft-core and many-soft-core designs. The list is ordered by Year of publication. You can acces the actual paper following the link on the design keyword.

The design keyword is a combination of the name of the first author, the conference/journal where it was published, and the year of publication.

If your design is not listed, PLEASE fill this form!
"

Design	Year	Device	#	Soft-Cores
Martina_MWSCAS_2002	2002	XCV1000	8	Propietary DSP
Li_FCCM_2003	2003	XCV1000E	95	Propietary (PIC based)
Hubner_PDP_2005	2005	XC2V3000	5	Microblaze
HUERTA_WSEAS Trans. Circuits Syst._2005	2005	XC2V6000	8	Microblaze
Hung_DATE_2005	2005	EP1S40	8	NIOS
Jin_CODES+ISSS_2005	2005	XC2VP50	12	Microblaze
Lehtoranta_FPL_2005	2005	EP1S40	4	NIOS
Ravindran_FPL_2005	2005	XC2VP50	14	Microblaze
Salminen_ISCAS_2005	2005	EP1S40	8	NIOS
Dykes_ICFPT_2007	2007	XC2VP30	2	Microblaze
Freitas_ISCAS_2007	2007	XC2VP20	4	Microblaze
Huerta_SPL_2007	2007	XC4VFX12	4	Microblaze
Krasnov_FPL_2007	2007	XC2VP70	12	Microblaze
Tumeo_GLSVLSI_2007	2007	XC2VP30	4	Microblaze
Mplemenos,_FCCM_2008	2008	XC5VLX330T	80	Microblaze
Pitter_SIES_2008	2008	EP2C35	8	JOP
Wang_IDT_2008	2008	XC4VFX140	24	Microblaze
Khan_EURASIP JES_2009	2009	"EP2S60 "	23	NIOS
Yan_ASICON_2009	2009	EP2S180	4	NIOS
Lee_JKSCI_2009	2009	EP2C70	4	NIOS
Bao_MCC_2009	2009	EP2C35	2	NIOS
Fernandez-Alonso_ICECS_2010	2010	EP2S180	16	NIOS
Giefers_FPL_2010	2010	XC5VLX110T	30	Microblaze
Lebedev_ReCONFIG_2010	2010	XC5VLX155T	49	C-Core, A-Core
Kornaros_JSA_2010	2010	XC4VFX20	4	Microblaze
Wang_FPT_2010	2010	XC5VL330	8	Propietary
Tumeo_DATE_2010	2010	XC2VP30	8	Microblaze
Castells-Rufas_FPT_2011	2011	EP2S180	16	NIOS
Chen_ICECC_2011	2011	XC5VFX130T	5	Microblaze
Castells-Rufas_ReCONFIG_2012	2012	EP4CE22	4	NIOS
Stevens_TBioCAS_2012	2012	XC6VLX240T	8	LE1
Han_SPIE_2012	2012	EP3C40	2	NIOS
Jing_Radar Conference_2013	2013	XC5VFX130T	4	Microblaze
Kondo_ASP-DAC_2013	2013	XC6VLX240T	8	Geyser
Plumbridge_ReCONFIG_2013	2013	XC6VLX240T	9	Microblaze
Choi_FPT_2013	2013	EP4SGX530	120	Application Specific
Plumbridge_SIGARCH_2014	2014	XC6VLX240T	20	Microblaze
Rashtchi_J CIRCUIT SYST COMP_2014	2014	EP4CGX150	11	NIOS
Raza_CONECCT_2014	2014	XC7K325T	3	LWP
Véstias_Arxiv_2014	2014	XCZ7020	32	Propietary
Baklouti_IJRC _2014	2014	5SGXEA7	32	NIOS
Podobas_MCSOC_2014	2014	5SGXEA7	120	Application Specific
Castells-Rufas_JCRA_2015	2015	EP4SGX530	128	NIOS
Jose_ANALOG INTEGR CIRC S_2015	2015	XC7Z020	16	Propietary
Kiefer_Embedded World_2015	2015	XC5VLX110T	8	ParaNUT (OpenRISC ISA)

To contribute to this list, please PLEASE fill this form!

dilluns, 11 de maig del 2015

Notes about gcc Inline assembly and NIOSII custom instructions

The syntax for gcc inline assembly is

asm ( assembler template:
output operands /* optional */:
input operands /* optional */:
list of clobbered registers /* optional */);

In NIOSII custom instructions are not invoked by using inline assembly, but with MACROS that are created with the BSP that end up using GCC builtin functions, such as

int __builtin_custom_inii (int n, int dataa, int datab);

But if you want to directly invoke custom instructions with inline assembly you can do, for instance

int a; // first input of the custom instruction
int b; // second input of the custom instruction
int r; // result
asm volatile ("custom 16, %0, %1, %2" : "=r" (r) : "r" (a), "r" (b) );

in the assembler template string you can use % and a number to indicate some placeholders that will be substituted by input and output parameters.

in the former example we have %0, %1, and %2 that will be substituted by the elements specified in the output operands and input operands section.

So %0 is substituted by what it is specified by "=r" (r), which can be interpreted as: "=r" -> assigment to a register, (r) -> the C variable where the value will be stored to.

%1 is substituted by what it is specified by "r" (a), that is interpreted as; "r" -> reference to a register, (a) -> the C variable that is used to get the value from

%2 is substituted by what it is specified by "r" (b), that is interpreted as; "r" -> reference to a register, (b) -> the C variable that is used to get the value from

Some more details about inline assembly can be found in https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

dimarts, 5 de maig del 2015

Scalability of OpenMP Linpack

After validating that the OpenMP version of (the "old") LINPACK was ok, I have measured its scalability on two platforms.

Looks like you get some scalability for a low number of threads, but it does not look like the application that you would use to test machines with hundreds of cores.

dilluns, 16 de març del 2015

Visualizing execution of parallel programs

So, in the last post I was surprised by the apparent slow down of the OpenMP version of Linpack with the increasing number of threads.

So I dicided to visualize what was happening on a desktop multicore machine.

After trying (very hard) to install the tracing infrastructure in cygwin, I finally gave up. Lots of problems, that I will detail later.

So I switched to my Ubuntu installation. Here it was fairly easy

First

download, build and install opari2. Opari is a source-to-source compiler to instrument OpenMP applications so that you can intercept interesing OpenMP events (like thread creation, etc.)

cd /usr/local/contrib
wget http://www.vi-hps.org/upload/packages/opari2/opari2-1.1.2.tar.gz
tar -xvf opari2-1.1.2.tar.gz
cd opari2-1.1.2
./configure
make
make install

then, download, build and install Scorep. Scorep is a collection of tools that allow you to easily instrument applications to generate profiling or tracing information. The same tool can be used to generate OTF2 traces (for Vampir analysis) or profiling info (for Scalasca, or Tau)

cd /usr/local/contrib
wget http://www.vi-hps.org/upload/packages/scorep/scorep-1.4.tar.gz
tar -xvf scorep-1.4.tar.gz
cd scorep-1.4
./configure --with-opari2=/opt/opari2/ --with-shmem=no
make
make install

then I just compiled my OpenMP linpack

export SCOREP_ENABLE_TRACING=true
/opt/scorep/bin/scorep --compiler gcc -fopenmp -DROLL -DDP -DCTimer -DOMPC -DORDER=100 clinpack.c -o

and executed it, obtaining the valuable trace files :-)

then I installed the Vampir tool (https://www.vampir.eu/) to visualize my traces...

...and realized that actually I was getting some speed up. So I quickly realized that I had a problem with the time measuring function, which was collecting cpu time (aggregated per core) instead of wall clock time.

by compiling again with the correct flag, FLOPS numbers were now coherent

/opt/scorep/bin/scorep --compiler gcc -fopenmp -DROLL -DDP -DGTODay -DOMPC -DORDER=100 clinpack.c -o

...as many times, stupid mistakes take some time to soar to the surface

About the problems compiling in cygwin. Some errors with C syntax, and many errors caused by the fact that the code assume that the file system of the host machine is case sensitive, which it's not the case for cygwin.

dijous, 26 de febrer del 2015

Benchmarking (Linpack)

Linpack on of the most popular benchmarks because it is used to create the Top500 list.

This is one of the first versions that you could run on a single core machine

mkdir linpack

cd linpack
mkdir base
cd base
wget http://www.netlib.org/benchmark/linpackc.new
mv linpackc.new linpack.new.c

Top500 is build using a more complex version

take care that a #define DP is writen in the source, so if you want to use single precision floating point numbers, you should remove that line.

To compile for double precission and single precission, simply type

sed -i '/#define DP/d' ./linpack.new.c
gcc -DDP -O4 -o linpack.new.dp linpack.new.c

gcc -DSP -O4 -o linpack.new.sp linpack.new.c

Fortunatelly there is an OpenMP implementation of the benchmark, although it is not exactly functionally equivalent

cd ..
mkdir omp_base

cd omp_base

wget http://www.hpcs.cs.tsukuba.ac.jp/omni-compiler/xcalablemp/download/trunk/tests/clinpack/clinpack.c

To compile it, type

gcc -fopenmp -DUNROLL -DDP -DCTimer -DOMPC -DORDER=1000 -O4 -o linpack.omp clinpack.c

The numbers are a little bit puzzling, since it seems that the OpenMP version is slower than the original (?)

dimecres, 25 de febrer del 2015

Benchmarking (ParBoil)

These days, we are trying to find some benchmarks so we can compare the performance that we can get with a many-soft-core, compared to other alternative systems (CPUs, GPUs, etc).

The requirements of this benchmark is that

should be written in C/C++
would be good to have implementations for alternative target platforms (Cuda, OpenCL, etc.)
should allow different size of data, so that we can use the problems sizes that fit into limited memory of the FPGA

Parboil looks like a good candidate

http://impact.crhc.illinois.edu/Parboil/parboil_download_page.aspx

you can also download it from phoronix-test-suite

wget http://www.phoronix-test-suite.com/benchmark-files/pb2.5driver.tgz
wget http://www.phoronix-test-suite.com/benchmark-files/pb2.5benchmarks.tgz
wget http://www.phoronix-test-suite.com/benchmark-files/pb2.5datasets_standard.tgz

you have to decompress the files and organize them into your filesystem

tar -xvf pb2.5driver.tgz
tar -xvf pb2.5benchmarks.tgz
tar -xvf pb2.5datasets_standard.tgz
mv benchmarks/ parboil/
mv datasets/ parboil/

you could have tried to avoid downloading the datasets (because it is a big file) but then the python script fails with an unconsistent error message ("benchmark directory not found")

so, at this point you can list the available benchmarks

cd parboil
./parboil list

to compile the files you first need a Makefile.conf in common directory, we create an empty one to start. And then we compile the base (simple sequencial C) version of an example e.g. bfs

touch common/Makefile.conf
./parboil compile bfs base

after compilation, you just need to specify the dataset to run it

./parboil run bfs base NY

we keep on looking other benchmarks

dimecres, 29 de juny del 2011

Many-Soft-Core

We use the term many-soft-cores to refer to an FPGA containing dozens of soft-core processors.