Wednesday, January 26, 2011

Excel

Introduction

Microsoft Excel is a commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables and a macro programming language called Visual Basic for Applications. It has been a very widely applied spreadsheet for these platforms, especially since version 5 in 1993. Excel forms part of Microsoft Office. The current versions are 2010 for Windows and 2011 for Mac.

Regression

Definition

A statistical measure that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables).

Regression involves taking the position of a child in some problematic situation, rather than acting in a more adult way. This is usually in response to stressful situations, with greater levels of stress potentially leading to more overt regressive acts.

Regressive behavior can be simple and harmless, such as a person who is sucking a pen (as a Freudian regression to oral fixation), or may be more dysfunctional, such as crying or using petulant arguments..

Linear Regression: Y = a + bX + u

Multiple Regression: Y = a + b₁X_1 ⁺ b₂X₂ + B₃X₃ + ... + B_tX_t + u

Where:
Y= the variable that we are trying to predict
X= the variable that we are using to predict Y
a= the intercept
b= the slope
u= the regression residual.

In multiple regression the separate variables are differentiated by using subscripted numbers.

Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical relationship between them. This relationship is typically in the form of a straight line (linear regression) that best approximates all the individual data points. Regression is often used to determine how much specific factors such as the price of a commodity, interest rates, particular industries or sectors influence the price movement of an asset.

Investopedia explains Regression

The two basic types of regression:

1. Linear Regression

Linear regression analyzes the relationship between two variables, X and Y. For each subject (or experimental unit), you know both X and Y and you want to find the best straight line through the data. In some situations, the slope and/or intercept have a scientific meaning. In other cases, you use the linear regression line as a standard curve to find new values of X from Y, or Y from X. In linear regression, models of the unknown parameters are estimated from the data using linear functions. Such models are called linear models. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis. Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications of linear regression fall into one of the following two broad categories:

If the goal is prediction, or forecasting, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y.
Given a variable y and a number of variables X₁, ..., X_p that may be related to y, then linear regression analysis can be applied to quantify the strength of the relationship between y and the X_j, to assess which X_j may have no relationship with y at all, and to identify which subsets of the X_j contain redundant information about y, thus once one of them is known, the others are no longer informative.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm, or by minimizing a penalized version of the least squares loss function as in ridge regression. Conversely, the least squares approach can be used to fit models that are not linear models. Thus, while the terms “least squares” and linear model are closely linked, they are not synonymous.

2. Quadratic Regression

A process by which the equation of a parabola of "best fit" is found for a set of data.Before performing the quadratic regression, first set an appropriate viewing rectangle.To calculate the Quadratic Regression, press STAT, then RIGHT ARROW to CALC. Now select 5:QuadReg.After QuadReg appears alone on the screen, press ENTER.Then the quadratic regression will appear on the screen.Y= while leaving PLOT1 on for the data values.Then press GRAPH to see how well the curve fits the data points. NOTE: The regression results may be copied directly into

for graphing purposes by using the following procedure: After the data values have been entered, press STAT, then RIGHT ARROW to CALC.

Now select 5:QuadReg.

After QuadReg appears alone on the screen, press VARS, then ARROW RIGHT to Y-VARS, noting 1:Function is selected. Press ENTER to accept and note that 1:

is already selected. Press ENTER to accept, then pressENTER to calculate. The result appears on the screen to several decimal places.

Now press to see that the equation has already been entered for

and is ready to graph.

This is the preferred method for entering the regression equation into

, since rounding the values can introduce significant rounding errors.

Here are the example of the regression:

Beer's Law states that there is a linear relationship between concentration of a colored compound in solution and the light absorption of the solution. This fact can be used to calculate the concentration of unknown solutions, given their absorption readings. First, a series of solutions of known concentration are tested for their absorption level. Next, a scatter plot is made of this empirical data and a linear regression line is fitted to the data. This regression line can be expressed as a formula and used to calculate the concentration of unknown solutions.

Strong acid-strong base titration, a strong base (NaOH) of known concentration is added to a strong acid (also of known concentration, in this case). As the strong base is added to solution, its OH- ions bind with the free H+ions of the acid. An equivalence point is reached when there are no free OH- nor H+ ions in the solution. This equivalence point can be found with a color indicator in the solution or through a pH titration curve.

Linear Regression

Quadratic regression

For more information:
Regression

Tuesday, January 11, 2011

SMILES

Introduction to SMILES

The whole name is almost as catchy as "SMILES". I used to think it was a strange way of representing molecules for computers. But it actually seems like a more straight forward way than IUPAC nomenclature. It's also shorter, and there's a possibility to have unique names. (It's more difficult to pronounce though.)

SMILES stand for "The Simplified Molecular Input Line Entry Specification" is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.The SMILES specification was developed by David Weininger in the late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc).

The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context.Typically, a number of equally valid SMILES can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanol. Algorithms have been developed to ensure the same SMILES is generated for a molecule regardless of the order of atoms in the structure.

There are two type of SMILES:

1. Canonical

2. Isomeric

Cononical SMILES refers to the version of the SMILES specification that includes rules for ensuring that each distinct chemical molecule has a single unique SMILES representation. This SMILES is unique for each structure, although dependent on the canonicalisation algorithm used to generate it.

Isomeric SMILES refers to the version of the SMILES specification that includes extensions to support the specification of isotopes, chirality, and configuration about double bonds. These are structural features that cannot be specified by connectivity alone and SMILES which encode this information.

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree. Aliphatic or nonaromatic carbon(C), atom in aromatic ring will be used lowercase letter Designate ring closure with pairs of matching digits.

Here are some structure images that I've draw using ACD/ChemSketch and also it's SMILES notation below the structure:

Branched Strucrures

Cyclic Structures

Aromatic Structures

Branched and Aromatic Structures

You can check for further example of smiles notation here.

You can try out SMILES strings at this page it's kind of fun. How to do it is described on wikipedia for example.

Ethane is just CC.

Add double and triple bonds like this:
C#CC=C for butenyne.

Add a branch in parentheses:
CC(C)CCC for 2-Methyl-n-pentane

If you want a ring add a number after the two atoms to be joined together:
C1C(C)CCC1 for Methyl-cyclo-pentane

Add a pyridyl group to the C next to the methyl group (aromatic atoms are written in lower case, and you have to include a second ring closure)
C1(c2ncccc2)C(C)CCC1 for (2-Pyridyl-)-2-methyl-c-pentane

You can add an extra oxirane ring:
C1(c2ncccc2)C(C)CCC13OC3

You can mess with stereochemistry (using @ and @@)
C1(c2ncccc2)[C@@H](C)CC[C@]13OC3

If you still haven't had enough, you can add a double bond in E configuration to the pyridyl ring:
C1(c2nc(/C=C(Cl)\C)ccc2)[C@@H](C)CC[C@]13OC3

or Z configuration
C1(c2nc(/C=C(Cl)/C)ccc2)[C@@H](C)CC[C@]13OC3

SMILES Bonds

Single*	-
Double	=
Triple	#
Aromatic*	:

*can be omitted

For more references:

Chemical Quantum Images

WordIQ

Go SMILES!

Wassalam..

Tuesday, January 4, 2011

Protein Data Bank (PDB)

Introduction to PDB

The PDB is the Protein Data Bank, a single worlwide repository for 3D structural data of biological molecules. A PDB is a file, typically with a "pdb" file extension, contains 3D structural data of a particular biological molecule. In short, a PDB file is broken into two sections:

(i) a header that contains much background information on the molecule in question such as authors and experimental conditions,

(ii) 3D coordinate data that contain the vital experimental data in the form of 3D cartesian coordinates, B-factors, atom information, and more.

   The Protein Data Bank (PDB) is a repository for the 3-D structural data of large biological molecules, such as proteins, viruses and nucleic acids.. The data are available to the public and typically obtained by X-ray crystallography (80%)or NMR spectroscopy (16%) and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations (PDBe, PDBj, andRCSB). The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

   The PDB is a key resource in areas of structural biology, such as structural genomics. Most major scientific journals, and some funding agencies, such as the NIH in the USA, now require scientists to submit their structure data to the PDB. If the contents of the PDB are thought of as primary data, then there are hundreds of derived (i.e., secondary) databases that categorize the data differently. For example, both SCOP andCATH categorize structures according to type of structure and assumed evolutionary relations; GO categorize structures based on genes.

   PDB was founded in 1971 by Brookhaven National Laboratory, New York. First set of data were entered on punched cards. Then with magnetic tapes. It was then transferred to the Research Collaborators for Structural Bioinformatics (RCSB) in 1998. Currently it holds 29,000 released structures and it is an important resource for research in the academic, pharmaceutical, and biotechnology sectors such as to know that will this molecule turns into a cancer cell? Can this combination of molecules cure common cold? How does radiation affect the RNA and DNA?

STRUCTURES:

1) Crystallographic analysis of counter-ion effect on Subtilisin enzymatic action in Acetonitrile

Display: Ball & Stick, Colour: CPK

DETAILS

Experimented Method:

Many bacterial pathogens produce extracellular proteases that degrade the extracellular matrix of the host and therefore are involved in disease pathogenesis. Dichelobacter nodosus is the causative agent of ovine footrot, a highly contagious disease that is characterized by the separation of the hoof from the underlying tissue. D. nodosus secretes three subtilisin-like proteases whose analysis forms the basis of diagnostic tests that differentiate between virulent and benign strains and have been postulated to play a role in virulence. We have constructed protease mutants of D. nodosus; their analysis in a sheep virulence model revealed that one of these enzymes, AprV2, was required for virulence. These studies challenge the previous hypothesis that the elastase activity of AprV2 is important for disease progression, since aprV2 mutants were virulent when complemented with aprB2, which encodes a variant that has impaired elastase activity. We have determined the crystal structures of both AprV2 and AprB2 and characterized the biological activity of these enzymes. These data reveal that an unusual extended disulphide-tethered loop functions as an exosite, mediating effective enzyme-substrate interactions. The disulphide bond and Tyr92, which was located at the exposed end of the loop, were functionally important. Bioinformatic analyses suggested that other pathogenic bacteria may have proteases that utilize a similar mechanism. In conclusion, we have used an integrated multidisciplinary combination of bacterial genetics, whole animal virulence trials in the original host, biochemical studies, and comprehensive aof crystal structures to provide the first definitive evidence that the extracellular secreted proteases produced by D. nodosus are required for virulence and to elucidate the molecular mechanism by which these proteases bind to their natural substrates. We postulate that this exosite mechanism may be used by proteases produced by other bacterial pathogens of both humans and animals.

Experiment method	X-RAY DIFFRACTION with resolution of 2.24 Å
Authors	Cianci, M., Tomaszewski, B., Helliwell, J.R., Halling, P.J.
Classification	Hydrolase

2) Structure of human cytosolic X-prolyl aminopeptidase

Display: Ribbons, Colour: Temperature

DETAILS

Experiment methods:

The prolyl aminopeptidase complexes of Ala-TBODA [2-alanyl-5-tert-butyl-(1, 3, 4)-oxadiazole] and Sar-TBODA [2-sarcosyl-5-tert-butyl-(1, 3, 4)-oxadiazole] were analyzed by X-ray crystallography at 2.4 angstroms resolution. Frames of alanine and sarcosineresidues were well superimposed on each other in the pyrrolidine ring of proline residue,suggesting that Ala and Sar are recognized as parts of this ring of proline residue by the presence of a hydrophobic proline pocket at the active site. Interestingly, there was anunusual extra space at the bottom of the hydrophobic pocket where proline residue is fixed in the prolyl aminopeptidase. Moreover, 4-acetyloxyproline-betaNA (4-acetyloxyprolinebeta-naphthylamide) was a better substrate than Pro-betaNA. Computer docking simulation well supports the idea that the 4-acetyloxyl group of the substrate fitted into that space. Alanine scanning mutagenesis of Phe139, Tyr149, Tyr150, Phe236, and Cys271,consisting of the hydrophobic pocket, revealed that all of these five residues are involvedsignificantly in the formation of the hydrophobic proline pocket for the substrate. Tyr149 and Cys271 may be important for the extra space and may orient the acetyl derivative of hydroxyproline to a preferable position for hydrolysis. These findings imply that the efficient degradation of collagen fragment may be achieved through an acetylation process by the bacteria.

Experiment method	X-RAY DIFFRACTION with resolution of 1.60 Å
Authors	Li, X., Lou, Z., Rao, Z.
Classification	Hydrolase

3) Human START domain of Acyl-coenzyme A thioesterase 11 (ACOT11)

Display: Wireframe, Colour: Shapely

DETAILS

Experiment methods :

Escherichia coli shows a pleiotropic response (the SOS response) to treatments that damage DNA or inhibit DNA replication. Previous evidence has suggested that the product of the lexA gene is involved in regulating the SOS response, perhaps as a repressor, and that it is sensitive to the recA protease. We show here that lexA protein is a repressor of at least two genes, recA and lexA. Purified protein bound specifically to the regulatory regions of the two genes, as judged by DNase I protection experiments, and it specifically inhibited in vitro transcription of both genes. The binding sites in recA and lexA were found to be about 20 base pairs (bp) and 40 bp long, respectively. The 40-bp sequence in lexA was composed of two adjacent 20-bp sequences, which had considerable homology to one another and to the corresponding recA sequence. These 20-bp sequences, which we term "SOS boxes," show considerable inverted repeat structure as well. These features suggest that each box represents a single repressor binding site. Finally, we found that purified lexA protein was a substrate for the recA protease in a reaction requiring ATP or an analogue, adenosine 5'-[gamma-thio]triphosphate, and denatured DNA.

Experiment method	X-RAY DIFFRACTION with resolution of 2.00 Å
Authors	Siponen, M.I., Lehtio, L., Arrowsmith, C.H., Berglund, H., Bountra, C., Collins, R., Dahlgren, L.G., Edwards, A.M., Flodin, S., Flores, A., Graslund, S., Hammarstrom, M., Johansson, A., Johansson, I., Karlberg, T., Kotenyova, T., Moche, M., Nilsson, M.E., Nordlund, P., Nyman, T., Persson, C., Sagemark, J., Thorsell, A.G., Tresaugues, L., Van-Den-Berg, S., Weigelt, J., Welin, M., Wikstrom, M., Wisniewska, M., Shueler, H., Structural Genomics Consortium (SGC)
Classification	Lipid Transport

All the structure of protein above are modified using the RasWin software where all the date has been taken from the link: RCSB

There are a lot of websites that you all can search in order to know mare about PDB such as:

1. Wikipedia

2. Molecules To Go

That's all for today's assignment. I hope you all will enjoy it. Tq.
Wassalam.