dijous, 2 de juliol del 2015

Bibliography on Multi-soft-core and Many-soft-core systems

By multi-soft-core and many-soft-core systems I mean multiprocessor systems implemented in FPGAs using soft-core processors.

Here you can find a list of multi-soft-core and many-soft-core designs. The list is ordered by Year of publication. You can acces the actual paper following the link on the design keyword.

The design keyword is a combination of the name of the first author, the conference/journal where it was published, and the year of publication.

If your design is not listed, PLEASE fill this form!
  "
Design Year Device # Soft-Cores
Martina_MWSCAS_2002 2002 XCV1000 8 Propietary DSP
Li_FCCM_2003 2003 XCV1000E 95 Propietary (PIC based)
Hubner_PDP_2005 2005 XC2V3000 5 Microblaze
HUERTA_WSEAS Trans. Circuits Syst._2005 2005 XC2V6000 8 Microblaze
Hung_DATE_2005 2005 EP1S40 8 NIOS
Jin_CODES+ISSS_2005 2005 XC2VP50 12 Microblaze
Lehtoranta_FPL_2005 2005 EP1S40 4 NIOS
Ravindran_FPL_2005 2005 XC2VP50 14 Microblaze
Salminen_ISCAS_2005 2005 EP1S40 8 NIOS
Dykes_ICFPT_2007 2007 XC2VP30 2 Microblaze
Freitas_ISCAS_2007 2007 XC2VP20 4 Microblaze
Huerta_SPL_2007 2007 XC4VFX12 4 Microblaze
Krasnov_FPL_2007 2007 XC2VP70 12 Microblaze
Tumeo_GLSVLSI_2007 2007 XC2VP30 4 Microblaze
Mplemenos,_FCCM_2008 2008 XC5VLX330T 80 Microblaze
Pitter_SIES_2008 2008 EP2C35 8 JOP
Wang_IDT_2008 2008 XC4VFX140 24 Microblaze
Khan_EURASIP JES_2009 2009 "EP2S60 " 23 NIOS
Yan_ASICON_2009 2009 EP2S180 4 NIOS
Lee_JKSCI_2009 2009 EP2C70 4 NIOS
Bao_MCC_2009 2009 EP2C35 2 NIOS
Fernandez-Alonso_ICECS_2010 2010 EP2S180 16 NIOS
Giefers_FPL_2010 2010 XC5VLX110T 30 Microblaze
Lebedev_ReCONFIG_2010 2010 XC5VLX155T 49 C-Core, A-Core
Kornaros_JSA_2010 2010 XC4VFX20 4 Microblaze
Wang_FPT_2010 2010 XC5VL330 8 Propietary
Tumeo_DATE_2010 2010 XC2VP30 8 Microblaze
Castells-Rufas_FPT_2011 2011 EP2S180 16 NIOS
Chen_ICECC_2011 2011 XC5VFX130T 5 Microblaze
Castells-Rufas_ReCONFIG_2012 2012 EP4CE22 4 NIOS
Stevens_TBioCAS_2012 2012 XC6VLX240T 8 LE1
Han_SPIE_2012 2012 EP3C40 2 NIOS
Jing_Radar Conference_2013 2013 XC5VFX130T 4 Microblaze
Kondo_ASP-DAC_2013 2013 XC6VLX240T 8 Geyser
Plumbridge_ReCONFIG_2013 2013 XC6VLX240T 9 Microblaze
Choi_FPT_2013 2013 EP4SGX530 120 Application Specific
Plumbridge_SIGARCH_2014 2014 XC6VLX240T 20 Microblaze
Rashtchi_J CIRCUIT SYST COMP_2014 2014 EP4CGX150 11 NIOS
Raza_CONECCT_2014 2014 XC7K325T 3 LWP
Véstias_Arxiv_2014 2014 XCZ7020 32 Propietary
Baklouti_IJRC _2014 2014 5SGXEA7 32 NIOS
Podobas_MCSOC_2014 2014 5SGXEA7 120 Application Specific
Castells-Rufas_JCRA_2015 2015 EP4SGX530 128 NIOS
Jose_ANALOG INTEGR CIRC S_2015 2015 XC7Z020 16 Propietary
Kiefer_Embedded World_2015 2015 XC5VLX110T 8 ParaNUT (OpenRISC ISA)

To contribute to this list, please PLEASE fill this form!


dilluns, 11 de maig del 2015

Notes about gcc Inline assembly and NIOSII custom instructions

The syntax for gcc inline assembly is

asm ( assembler template:
output operands /* optional */:
input operands /* optional */:
list of clobbered registers /* optional */);


In NIOSII custom instructions are not invoked by using inline assembly, but with MACROS that are created with the BSP that end up using GCC builtin functions, such as

int __builtin_custom_inii (int n, int dataa, int datab);

But if you want to directly invoke custom instructions with inline assembly you can do, for instance

int a;  // first input of the custom instruction
int b;  // second input of the custom instruction
int r;  // result
asm volatile ("custom 16, %0, %1, %2" : "=r" (r) : "r" (a), "r" (b) );

in the assembler template string you can use % and a number to indicate some placeholders that will be substituted by input and output parameters.

in the former example we have %0, %1, and %2 that will be substituted by the elements specified in the output operands and input operands section.

So %0 is substituted by what it is specified by "=r" (r), which can be interpreted as: "=r" -> assigment to a register, (r) -> the C variable where the value will be stored to.

%1 is substituted by what it is specified by "r" (a), that is interpreted as; "r" -> reference to a register, (a) -> the C variable that is used to get the value from

%2 is substituted by what it is specified by "r" (b), that is interpreted as; "r" -> reference to a register, (b) -> the C variable that is used to get the value from

Some more details about inline assembly can be found in https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

dimarts, 5 de maig del 2015

Scalability of OpenMP Linpack


After validating that the OpenMP version of (the "old") LINPACK was ok, I have measured its scalability on two platforms.

Looks like you get some scalability for a low number of threads, but it does not look like the application that you would use to test machines with hundreds of cores.


dilluns, 16 de març del 2015

Visualizing execution of parallel programs

So, in the last post I was surprised by the apparent slow down of the OpenMP version of Linpack with the increasing number of threads.

So I dicided to visualize what was happening on a desktop multicore machine.

After trying (very hard) to install the tracing infrastructure in cygwin, I finally gave up. Lots of problems, that I will detail later.

So I switched to my Ubuntu installation. Here it was fairly easy

First

download, build and install opari2. Opari is a source-to-source compiler to instrument OpenMP applications so that you can intercept interesing OpenMP events (like thread creation, etc.)

cd /usr/local/contrib
wget http://www.vi-hps.org/upload/packages/opari2/opari2-1.1.2.tar.gz
tar -xvf opari2-1.1.2.tar.gz
cd opari2-1.1.2
./configure
make
make install

then, download, build and install Scorep. Scorep is a collection of tools that allow you to easily instrument applications to generate profiling or tracing information. The same tool can be used to generate OTF2 traces (for Vampir analysis) or profiling info (for Scalasca, or Tau)

cd /usr/local/contrib
wget http://www.vi-hps.org/upload/packages/scorep/scorep-1.4.tar.gz
tar -xvf scorep-1.4.tar.gz
cd  scorep-1.4
./configure --with-opari2=/opt/opari2/ --with-shmem=no
make
make install



then I just compiled my OpenMP linpack

export SCOREP_ENABLE_TRACING=true
/opt/scorep/bin/scorep --compiler gcc -fopenmp -DROLL -DDP -DCTimer -DOMPC -DORDER=100 clinpack.c -o

and executed it, obtaining the valuable trace files :-)

then I installed the Vampir tool (https://www.vampir.eu/) to visualize my traces...



...and realized that actually I was getting some speed up. So I quickly realized that I had a problem with the time measuring function, which was collecting cpu time (aggregated per core) instead of wall clock time.

by compiling again with the correct flag, FLOPS numbers were now coherent 

/opt/scorep/bin/scorep --compiler gcc -fopenmp -DROLL -DDP -DGTODay -DOMPC -DORDER=100 clinpack.c -o

...as many times, stupid mistakes take some time to soar to the surface


About the problems compiling in cygwin. Some errors with C syntax, and many errors caused by the fact that the code assume that the file system of the host machine is case sensitive, which it's not the case for cygwin.

dijous, 26 de febrer del 2015

Benchmarking (Linpack)



Linpack on of the most popular benchmarks because it is used to create the Top500 list.

This is one of the first versions that you could run on a single core machine


mkdir linpack
cd linpack
mkdir base
cd base
wget http://www.netlib.org/benchmark/linpackc.new
mv linpackc.new linpack.new.c


Top500 is build using a more complex version

take care that a #define DP is writen in the source, so if you want to use single precision floating point numbers, you should remove that line.

To compile for double precission and single precission, simply type

sed -i '/#define DP/d' ./linpack.new.c
gcc -DDP -O4 -o linpack.new.dp linpack.new.c
gcc -DSP -O4 -o linpack.new.sp linpack.new.c

Fortunatelly there is an OpenMP implementation of the benchmark, although it is not exactly functionally equivalent


cd ..
mkdir omp_base 
cd omp_base
wget http://www.hpcs.cs.tsukuba.ac.jp/omni-compiler/xcalablemp/download/trunk/tests/clinpack/clinpack.c

To compile it, type

gcc -fopenmp -DUNROLL -DDP -DCTimer -DOMPC -DORDER=1000 -O4 -o linpack.omp clinpack.c

The numbers are a little bit puzzling, since it seems that the OpenMP version is slower than the original (?)

dimecres, 25 de febrer del 2015

Benchmarking (ParBoil)

These days, we are trying to find some benchmarks so we can compare the performance that we can get with a many-soft-core, compared to other alternative systems (CPUs, GPUs, etc).

The requirements of this benchmark is that
  • should be written in C/C++
  • would be good to have implementations for alternative target platforms (Cuda, OpenCL, etc.)
  • should allow different size of data, so that we can use the problems sizes that fit into limited memory of the FPGA

Parboil looks like a good candidate

http://impact.crhc.illinois.edu/Parboil/parboil_download_page.aspx

you can also download it from phoronix-test-suite

wget http://www.phoronix-test-suite.com/benchmark-files/pb2.5driver.tgz
wget http://www.phoronix-test-suite.com/benchmark-files/pb2.5benchmarks.tgz
wget http://www.phoronix-test-suite.com/benchmark-files/pb2.5datasets_standard.tgz

you have to decompress the files and organize them into your filesystem

tar -xvf pb2.5driver.tgz
tar -xvf pb2.5benchmarks.tgz
tar -xvf pb2.5datasets_standard.tgz
mv benchmarks/ parboil/
mv datasets/ parboil/

you could have tried to avoid downloading the datasets (because it is a big file) but then the python script fails with an unconsistent error message ("benchmark directory not found")

so, at this point you can list the available benchmarks

cd parboil
./parboil list

to compile the files you first need a Makefile.conf in common directory, we create an empty one to start. And then we compile the base (simple sequencial C) version of an example e.g. bfs

touch common/Makefile.conf
./parboil compile bfs base

after compilation, you just need to specify the dataset to run it

./parboil run bfs base NY

we keep on looking other benchmarks

dimecres, 29 de juny del 2011

Many-Soft-Core

We use the term many-soft-cores to refer to an FPGA containing dozens of soft-core processors.