Skip to content

Commit f19e149

Browse files
initial
1 parent 7346438 commit f19e149

File tree

13,822 files changed

+1182790
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

13,822 files changed

+1182790
-1
lines changed

CONTRIBUTING.md

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
## Contributing to Spark
2+
3+
*Before opening a pull request*, review the
4+
[Contributing to Spark guide](http://spark.apache.org/contributing.html).
5+
It lists steps that are required before creating a PR. In particular, consider:
6+
7+
- Is the change important and ready enough to ask the community to spend time reviewing?
8+
- Have you searched for existing, related JIRAs and pull requests?
9+
- Is this a new feature that can stand alone as a [third party project](http://spark.apache.org/third-party-projects.html) ?
10+
- Is the change being proposed clearly explained and motivated?
11+
12+
When you contribute code, you affirm that the contribution is your original work and that you
13+
license the work to the project under the project's open source license. Whether or not you
14+
state this explicitly, by submitting any copyrighted material via pull request, email, or
15+
other means you agree to license the material under the project's open source license and
16+
warrant that you have the legal authority to do so.

LICENSE

+299
Large diffs are not rendered by default.

NOTICE

+661
Large diffs are not rendered by default.

R/.gitignore

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
*.o
2+
*.so
3+
*.Rd
4+
lib
5+
pkg/man
6+
pkg/html
7+
SparkR.Rcheck/
8+
SparkR_*.tar.gz

R/CRAN_RELEASE.md

+91
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# SparkR CRAN Release
2+
3+
To release SparkR as a package to CRAN, we would use the `devtools` package. Please work with the
4+
`dev@spark.apache.org` community and R package maintainer on this.
5+
6+
### Release
7+
8+
First, check that the `Version:` field in the `pkg/DESCRIPTION` file is updated. Also, check for stale files not under source control.
9+
10+
Note that while `run-tests.sh` runs `check-cran.sh` (which runs `R CMD check`), it is doing so with `--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - therefore it will be preferred to run `R CMD check` on the source package built manually before uploading a release. Also note that for CRAN checks for pdf vignettes to success, `qpdf` tool must be there (to install it, eg. `yum -q -y install qpdf`).
11+
12+
To upload a release, we would need to update the `cran-comments.md`. This should generally contain the results from running the `check-cran.sh` script along with comments on status of all `WARNING` (should not be any) or `NOTE`. As a part of `check-cran.sh` and the release process, the vignettes is build - make sure `SPARK_HOME` is set and Spark jars are accessible.
13+
14+
Once everything is in place, run in R under the `SPARK_HOME/R` directory:
15+
16+
```R
17+
paths <- .libPaths(); .libPaths(c("lib", paths)); Sys.setenv(SPARK_HOME=tools::file_path_as_absolute("..")); devtools::release(); .libPaths(paths)
18+
```
19+
20+
For more information please refer to http://r-pkgs.had.co.nz/release.html#release-check
21+
22+
### Testing: build package manually
23+
24+
To build package manually such as to inspect the resulting `.tar.gz` file content, we would also use the `devtools` package.
25+
26+
Source package is what get released to CRAN. CRAN would then build platform-specific binary packages from the source package.
27+
28+
#### Build source package
29+
30+
To build source package locally without releasing to CRAN, run in R under the `SPARK_HOME/R` directory:
31+
32+
```R
33+
paths <- .libPaths(); .libPaths(c("lib", paths)); Sys.setenv(SPARK_HOME=tools::file_path_as_absolute("..")); devtools::build("pkg"); .libPaths(paths)
34+
```
35+
36+
(http://r-pkgs.had.co.nz/vignettes.html#vignette-workflow-2)
37+
38+
Similarly, the source package is also created by `check-cran.sh` with `R CMD build pkg`.
39+
40+
For example, this should be the content of the source package:
41+
42+
```sh
43+
DESCRIPTION R inst tests
44+
NAMESPACE build man vignettes
45+
46+
inst/doc/
47+
sparkr-vignettes.html
48+
sparkr-vignettes.Rmd
49+
sparkr-vignettes.Rman
50+
51+
build/
52+
vignette.rds
53+
54+
man/
55+
*.Rd files...
56+
57+
vignettes/
58+
sparkr-vignettes.Rmd
59+
```
60+
61+
#### Test source package
62+
63+
To install, run this:
64+
65+
```sh
66+
R CMD INSTALL SparkR_2.1.0.tar.gz
67+
```
68+
69+
With "2.1.0" replaced with the version of SparkR.
70+
71+
This command installs SparkR to the default libPaths. Once that is done, you should be able to start R and run:
72+
73+
```R
74+
library(SparkR)
75+
vignette("sparkr-vignettes", package="SparkR")
76+
```
77+
78+
#### Build binary package
79+
80+
To build binary package locally, run in R under the `SPARK_HOME/R` directory:
81+
82+
```R
83+
paths <- .libPaths(); .libPaths(c("lib", paths)); Sys.setenv(SPARK_HOME=tools::file_path_as_absolute("..")); devtools::build("pkg", binary = TRUE); .libPaths(paths)
84+
```
85+
86+
For example, this should be the content of the binary package:
87+
88+
```sh
89+
DESCRIPTION Meta R html tests
90+
INDEX NAMESPACE help profile worker
91+
```

R/DOCUMENTATION.md

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# SparkR Documentation
2+
3+
SparkR documentation is generated by using in-source comments and annotated by using
4+
[`roxygen2`](https://cran.r-project.org/web/packages/roxygen2/index.html). After making changes to the documentation and generating man pages,
5+
you can run the following from an R console in the SparkR home directory
6+
```R
7+
library(devtools)
8+
devtools::document(pkg="./pkg", roclets=c("rd"))
9+
```
10+
You can verify if your changes are good by running
11+
12+
R CMD check pkg/

R/README.md

+81
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# R on Spark
2+
3+
SparkR is an R package that provides a light-weight frontend to use Spark from R.
4+
5+
### Installing sparkR
6+
7+
Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be done by running the script `$SPARK_HOME/R/install-dev.sh`.
8+
By default the above script uses the system wide installation of R. However, this can be changed to any user installed location of R by setting the environment variable `R_HOME` the full path of the base directory where R is installed, before running install-dev.sh script.
9+
Example:
10+
```bash
11+
# where /home/username/R is where R is installed and /home/username/R/bin contains the files R and RScript
12+
export R_HOME=/home/username/R
13+
./install-dev.sh
14+
```
15+
16+
### SparkR development
17+
18+
#### Build Spark
19+
20+
Build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
21+
22+
```bash
23+
build/mvn -DskipTests -Psparkr package
24+
```
25+
26+
#### Running sparkR
27+
28+
You can start using SparkR by launching the SparkR shell with
29+
30+
./bin/sparkR
31+
32+
The `sparkR` script automatically creates a SparkContext with Spark by default in
33+
local mode. To specify the Spark master of a cluster for the automatically created
34+
SparkContext, you can run
35+
36+
./bin/sparkR --master "local[2]"
37+
38+
To set other options like driver memory, executor memory etc. you can pass in the [spark-submit](http://spark.apache.org/docs/latest/submitting-applications.html) arguments to `./bin/sparkR`
39+
40+
#### Using SparkR from RStudio
41+
42+
If you wish to use SparkR from RStudio or other R frontends you will need to set some environment variables which point SparkR to your Spark installation. For example
43+
```R
44+
# Set this to where Spark is installed
45+
Sys.setenv(SPARK_HOME="/Users/username/spark")
46+
# This line loads SparkR from the installed directory
47+
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
48+
library(SparkR)
49+
sparkR.session()
50+
```
51+
52+
#### Making changes to SparkR
53+
54+
The [instructions](http://spark.apache.org/contributing.html) for making contributions to Spark also apply to SparkR.
55+
If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes.
56+
Once you have made your changes, please include unit tests for them and run existing unit tests using the `R/run-tests.sh` script as described below.
57+
58+
#### Generating documentation
59+
60+
The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script. Also, you may need to install these [prerequisites](https://github.com/apache/spark/tree/master/docs#prerequisites). See also, `R/DOCUMENTATION.md`
61+
62+
### Examples, Unit tests
63+
64+
SparkR comes with several sample programs in the `examples/src/main/r` directory.
65+
To run one of them, use `./bin/spark-submit <filename> <args>`. For example:
66+
```bash
67+
./bin/spark-submit examples/src/main/r/dataframe.R
68+
```
69+
You can also run the unit tests for SparkR by running. You need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first:
70+
```bash
71+
R -e 'install.packages("testthat", repos="http://cran.us.r-project.org")'
72+
./R/run-tests.sh
73+
```
74+
75+
### Running on YARN
76+
77+
The `./bin/spark-submit` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run
78+
```bash
79+
export YARN_CONF_DIR=/etc/hadoop/conf
80+
./bin/spark-submit --master yarn examples/src/main/r/dataframe.R
81+
```

R/WINDOWS.md

+43
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
## Building SparkR on Windows
2+
3+
To build SparkR on Windows, the following steps are required
4+
5+
1. Install R (>= 3.1) and [Rtools](http://cran.r-project.org/bin/windows/Rtools/). Make sure to
6+
include Rtools and R in `PATH`.
7+
8+
2. Install
9+
[JDK7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html) and set
10+
`JAVA_HOME` in the system environment variables.
11+
12+
3. Download and install [Maven](http://maven.apache.org/download.html). Also include the `bin`
13+
directory in Maven in `PATH`.
14+
15+
4. Set `MAVEN_OPTS` as described in [Building Spark](http://spark.apache.org/docs/latest/building-spark.html).
16+
17+
5. Open a command shell (`cmd`) in the Spark directory and build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
18+
19+
```bash
20+
mvn.cmd -DskipTests -Psparkr package
21+
```
22+
23+
`.\build\mvn` is a shell script so `mvn.cmd` should be used directly on Windows.
24+
25+
## Unit tests
26+
27+
To run the SparkR unit tests on Windows, the following steps are required —assuming you are in the Spark root directory and do not have Apache Hadoop installed already:
28+
29+
1. Create a folder to download Hadoop related files for Windows. For example, `cd ..` and `mkdir hadoop`.
30+
31+
2. Download the relevant Hadoop bin package from [steveloughran/winutils](https://github.com/steveloughran/winutils). While these are not official ASF artifacts, they are built from the ASF release git hashes by a Hadoop PMC member on a dedicated Windows VM. For further reading, consult [Windows Problems on the Hadoop wiki](https://wiki.apache.org/hadoop/WindowsProblems).
32+
33+
3. Install the files into `hadoop\bin`; make sure that `winutils.exe` and `hadoop.dll` are present.
34+
35+
4. Set the environment variable `HADOOP_HOME` to the full path to the newly created `hadoop` directory.
36+
37+
5. Run unit tests for SparkR by running the command below. You need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first:
38+
39+
```
40+
R -e "install.packages('testthat', repos='http://cran.us.r-project.org')"
41+
.\bin\spark-submit2.cmd --conf spark.hadoop.fs.default.name="file:///" R\pkg\tests\run-all.R
42+
```
43+

R/check-cran.sh

+102
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
#!/bin/bash
2+
3+
#
4+
# Licensed to the Apache Software Foundation (ASF) under one or more
5+
# contributor license agreements. See the NOTICE file distributed with
6+
# this work for additional information regarding copyright ownership.
7+
# The ASF licenses this file to You under the Apache License, Version 2.0
8+
# (the "License"); you may not use this file except in compliance with
9+
# the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
#
19+
20+
set -o pipefail
21+
set -e
22+
23+
FWDIR="$(cd `dirname $0`; pwd)"
24+
pushd $FWDIR > /dev/null
25+
26+
if [ ! -z "$R_HOME" ]
27+
then
28+
R_SCRIPT_PATH="$R_HOME/bin"
29+
else
30+
# if system wide R_HOME is not found, then exit
31+
if [ ! `command -v R` ]; then
32+
echo "Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is properly installed."
33+
exit 1
34+
fi
35+
R_SCRIPT_PATH="$(dirname $(which R))"
36+
fi
37+
echo "Using R_SCRIPT_PATH = ${R_SCRIPT_PATH}"
38+
39+
# Install the package (this is required for code in vignettes to run when building it later)
40+
# Build the latest docs, but not vignettes, which is built with the package next
41+
$FWDIR/create-docs.sh
42+
43+
# Build source package with vignettes
44+
SPARK_HOME="$(cd "${FWDIR}"/..; pwd)"
45+
. "${SPARK_HOME}"/bin/load-spark-env.sh
46+
if [ -f "${SPARK_HOME}/RELEASE" ]; then
47+
SPARK_JARS_DIR="${SPARK_HOME}/jars"
48+
else
49+
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
50+
fi
51+
52+
if [ -d "$SPARK_JARS_DIR" ]; then
53+
# Build a zip file containing the source package with vignettes
54+
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD build $FWDIR/pkg
55+
56+
find pkg/vignettes/. -not -name '.' -not -name '*.Rmd' -not -name '*.md' -not -name '*.pdf' -not -name '*.html' -delete
57+
else
58+
echo "Error Spark JARs not found in $SPARK_HOME"
59+
exit 1
60+
fi
61+
62+
# Run check as-cran.
63+
VERSION=`grep Version $FWDIR/pkg/DESCRIPTION | awk '{print $NF}'`
64+
65+
CRAN_CHECK_OPTIONS="--as-cran"
66+
67+
if [ -n "$NO_TESTS" ]
68+
then
69+
CRAN_CHECK_OPTIONS=$CRAN_CHECK_OPTIONS" --no-tests"
70+
fi
71+
72+
if [ -n "$NO_MANUAL" ]
73+
then
74+
CRAN_CHECK_OPTIONS=$CRAN_CHECK_OPTIONS" --no-manual --no-vignettes"
75+
fi
76+
77+
echo "Running CRAN check with $CRAN_CHECK_OPTIONS options"
78+
79+
if [ -n "$NO_TESTS" ] && [ -n "$NO_MANUAL" ]
80+
then
81+
"$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
82+
else
83+
# This will run tests and/or build vignettes, and require SPARK_HOME
84+
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
85+
fi
86+
87+
# Install source package to get it to generate vignettes rds files, etc.
88+
if [ -n "$CLEAN_INSTALL" ]
89+
then
90+
echo "Removing lib path and installing from source package"
91+
LIB_DIR="$FWDIR/lib"
92+
rm -rf $LIB_DIR
93+
mkdir -p $LIB_DIR
94+
"$R_SCRIPT_PATH/"R CMD INSTALL SparkR_"$VERSION".tar.gz --library=$LIB_DIR
95+
96+
# Zip the SparkR package so that it can be distributed to worker nodes on YARN
97+
pushd $LIB_DIR > /dev/null
98+
jar cfM "$LIB_DIR/sparkr.zip" SparkR
99+
popd > /dev/null
100+
fi
101+
102+
popd > /dev/null

0 commit comments

Comments
 (0)